Developing larger programs

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How do I develop a larger program from scratch?

Objectives
  • Identify steps in program development.

  • Use finished code in functions.

NOTE this is a incompletely developed lesson, for example, the output of the different commands is not present.

Developing a program to plot GC Percentage of a bacterial genome

We want to create a plot of the frequency distribution of GC% of a genome in 100 bp windows. Goals:

Reading in the genome

We already know how to a fasta file. This time, we use genomic DNA, not RNA.

filename = 'data/NC_000913_E_coli_K12.fasta'
f = open(filename)
lines = f.readlines()
f.close()

print("Identifier:")
print(lines[0])
genome = ""
for line in lines[1:]:
    genome += line.strip()
print(len(genome))

Creating non-overlapping windows

We want to develop a program that splits a DNA sequence into 100 bp, non overlapping windows. It is always a good idea to start with a simpler example, make that work, and then expand to the real situation you need the program for.

So, let’s use a dummy DNA sequence first.

seq = 'A' * 10 + 'C' * 10 + 'G' * 10 + 'T' * 10
seq = 3 * seq
print(seq)

Now we split it into 10 base windows. We can us the range(start, stop, step) command to construct a loop for the indices:

length = len(seq)
window = 10
for i in range(0, length, window):
    print(i)

This gives us the starting index. The final index is the starting index + the length of the window. Let’s check we now get the right indices:

length = len(seq)
window = 10
for i in range(0, length, window):
    print(i, i + window)

Now we can slice the dna sequence:

length = len(seq)
window = 10
for i in range(0, length, window):
    print(i, i + window, seq[i: i + window])

What if we do 100 bp windows?

length = len(seq)
window = 100
for i in range(0, length, window):
    print(i, i + window, seq[i: i + window])

Creating non-overlapping windows, DNA

We have the basic code, let’s adjust it for our genome. We’ll test for the first 10 windows only here

length = 10000 # would normally use len(genome)
window = 100
for i in range(0, length, window):
    segment = genome[i: i + window]
    print(i, i + 10, segment)

GC percentage

Now we can start developing code for calculating the GC% for each segment. We have seen this before in one of the exercises:

dna = "ACGT"
g = dna.count("G")
c = dna.count("C")
gc = (g + c)*100/len(dna)
print(gc)
print(round(gc))

For simplicity, we will round off the percentage to whole integers.

Wrap this in a function

def perc_gc(dna):
    """Calculates percentage GC for a DNA sequence
       Rounds off to zero decimals"""
    dna = dna.upper()
    g = dna.count("G")
    c = dna.count("C")
    gc = (g + c)*100/len(dna)
    return(round(gc))

dna = "ACGT"
print(perc_gc(dna))

Use our function for the genome

length = len(genome)
window = 100
for i in range(0, length, window):
    segment = genome[i: i + window]
    print(perc_gc(segment))

We want to collect the percentages so we can plot a histogram og their frequencies. We’ll try a list of percentages first.

length = len(genome)
window = 100
GCs = []
for i in range(0, length, window):
    segment = genome[i: i + window]
    gc = perc_gc(segment)
    GCs.append(gc)

Let’s plot this! We use the hist (histogram) function from matplotlib:

import matplotlib.pyplot as plt
plt.hist(GCs)
plt.show()

With more resolution:

plt.hist(GCs, bins = 100)
plt.show()

Use dictionary instead

Here is a way to make a line plot of the frequencies. There are better ways, but this illustrates the principle with fairly easy code. First, we replace the list with a dictionary:

length = len(genome)
window = 100
GCs = {}
for i in range(0, length, window):
    segment = genome[i: i + window]
    gc = perc_gc(segment)
    GCs[gc] = GCs[gc] + 1

Hmm, this does not work as the dictionary is empty and we try to increase a non exitsing key + value combination.

We can set up the dictionary to have all possible keys (gc percentages), with 0 for all values:

GCs = {}
for i in range(0,100):
    GCs[i] = 0
GCs

Now we can fill the dictionary:

for i in range(0, length, window):
    segment = genome[i: i + window]
    gc = perc_gc(segment)
    GCs[gc] += 1

The full program than becomes:

length = len(genome)
window = 100
GCs = {}
for i in range(0,100):
    GCs[i] = 0
for i in range(0, length, window):
    segment = genome[i: i + window]
    gc = perc_gc(segment)
    GCs[gc] += 1
GCs

Plotting

Now we can make a line plot with keys on the x axis and values on the y:

plt.plot(GCs.keys(), GCs.values(), '-')
plt.show()

Adding axis labels:

plt.plot(GCs.keys(), GCs.values(), '-')
plt.xlabel("Percentage GC in 100 bp window")
plt.ylabel("Count")
plt.show()

Wrap all parts in a function

def get_genome(filename):
    """Reads a fasta file and splits into identifier (first line)
       and sequence (concatenation of all other lines)"""
    f = open(filename)
    lines = f.readlines()
    f.close()

    identifier = lines[0]
    genome = ""
    for line in lines[1:]:
        genome += line.strip()
    return(identifier, genome)

filename = 'data/NC_000913_E_coli_K12.fasta'
identifier, genome = get_genome(filename)
print(identifier, len(genome))
def get_gc_freq(genome, window):
    """Creates a dictionary with frequency of %GC for each window in genome"""
    length = len(genome)
    GCs = {}
    for i in range(0,100):
        GCs[i] = 0
    for i in range(0, length, window):
        segment = genome[i: i + window]
        gc = perc_gc(segment)
        GCs[gc] += 1
    return(GCs)

The main program becomes now very short!

# load genome
filename = 'data/NC_000913_E_coli_K12.fasta'
identifier, genome = get_genome(filename)
print(identifier, len(genome))

# get GC frequencies
window = 100
GCs = get_gc_freq(genome, window)

# plot
plt.plot(GCs.keys(), GCs.values(), '-')
plt.xlabel("Percentage GC in 100 bp window")
plt.ylabel("Count")
plt.show()

Let’s try the two other genomes in the data folder:

filename2 = 'data/AKVW01.1_Rhodobacter_sphaeroides.fasta'
id2, genome2 = get_genome(filename2)
print(id2, len(genome2))
GCs2 = get_gc_freq(genome2, window)
plt.plot(GCs2.keys(), GCs2.values(), '-')
plt.xlabel("Percantage GC in 100 bp window")
plt.ylabel("Count")
plt.show()

Can we plot two datasets together?

plt.plot(GCs.keys(), GCs.values(), '-')
plt.plot(GCs2.keys(), GCs2.values(), '-')
plt.xlabel("Percantage GC in 100 bp window")
plt.ylabel("Count")
plt.show()

What about the third dataset?

filename3 = 'data/AL844501.2_Plasmodium_falciparum.fasta'
id3, genome3 = get_genome(filename3)
print(id3, len(genome3))
GCs3 = get_gc_freq(genome3, window)
plt.plot(GCs.keys(), GCs.values(), '-')
plt.plot(GCs2.keys(), GCs2.values(), '-')
plt.plot(GCs3.keys(), GCs3.values(), '-')
plt.xlabel("Percantage GC in 100 bp window")
plt.ylabel("Count")
plt.show()

Scaling the Malaria data

GCs3 = get_gc_freq(genome3[0:7000000], window)
plt.plot(GCs.keys(), GCs.values(), '-')
plt.plot(GCs2.keys(), GCs2.values(), '-')
plt.plot(GCs3.keys(), GCs3.values(), '-')
plt.xlabel("Percantage GC in 100 bp window")
plt.ylabel("Count")
plt.show()

Add legend

GCs3 = get_gc_freq(genome3[0:7000000], window)
plt.plot(GCs.keys(), GCs.values(), '-', label = "E. coli")
plt.plot(GCs2.keys(), GCs2.values(), '-', label = "R. sphaeroides")
plt.plot(GCs3.keys(), GCs3.values(), '-', label = "P. falciparum")
plt.xlabel("Percantage GC in 100 bp window")
plt.ylabel("Count")
plt.legend()
plt.show()

Many styles of plot are available.

plt.style.use('ggplot')
plt.plot(GCs.keys(), GCs.values(), '-', label = "E. coli")
plt.plot(GCs2.keys(), GCs2.values(), '-', label = "R. sphaeroides")
plt.plot(GCs3.keys(), GCs3.values(), '-', label = "P. falciparum")
plt.xlabel("Percantage GC in 100 bp window")
plt.ylabel("Count")
plt.legend()
plt.show()

The data

The references used for alignment were

Key Points

  • Programs are developed in steps

  • Modularise program design using functions