Using BioPython

Overview

Teaching: 0 min
Exercises: 15 min

Questions

What and how can we use the BioPython library?

Objectives

Basic understanding of BioPython usage

What is Biopython

From biopython.org: “Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.”

An example function

from Bio.SeqUtils import GC
GC("ACTGN")

The source code of this function can be found on Github, in the folder Bio, folder SeqUtils, file __init__.py (direct link).

def GC(seq):
    """Calculate G+C content, return percentage (as float between 0 and 100).
    Copes mixed case sequences, and with the ambiguous nucleotide S (G or C)
    when counting the G and C content.  The percentage is calculated against
    the full length, e.g.:
    >>> from Bio.SeqUtils import GC
    >>> GC("ACTGN")
    40.0
    Note that this will return zero for an empty sequence.
    """
    gc = sum(seq.count(x) for x in ['G', 'C', 'g', 'c', 'S', 's'])
    try:
        return gc * 100.0 / len(seq)
    except ZeroDivisionError:
        return 0.0

See also the file test_SeqUtils.py where the GC function is tested.

It is good practice to add some tests, positive and negative controls, for a function. A simpler way to do this is to add asserts:

def test_gc():
    seq = "ACGGGCTACCGTATAGGCAAGAGATGATGCCC"
    assert GC(seq) == 56.25

Run as:

test_gc()

There are programs that will automatically run all tests in a module (package) and report on which ones failed.

Working with sequences

Adapted from the Biopython wiki.

Suppose you have a GenBank file which you want to turn into a Fasta file. For example, let’s consider the file cor6_6.gb which corresponds to GenBank file X55053. This file is included in the data folder. Study the file: there is a lot of information there. How many gene sequences are in this file?

If you want to extract the sequences into a file in fasta format, you’d have to write quite a bit of Python code. Luckily, this has already been done by the Biopython developers. We can use this code as follows:

from Bio import SeqIO

sequences = SeqIO.parse("data/cor6_6.gb", "genbank")
count = SeqIO.write(sequences, "cor6_6.fasta", "fasta")

print("Converted %i records" % count)

An excellent tutorial on using BioPython can be found on Github.

Key Points

BioPython contains many useful functions for bioinformatics.

previous episode

Programming in Python for MBV3070 fall 2019 at UiO

lesson home