CS364 Lab 3: Genome Assembly


Overview

In this lab you will implement your own de Bruijn graph assembler and apply it to a realistic (although synthetic) dataset. The goals of this lab are to develop a deeper understanding of de Bruijn graphs, gain experience going from theory to application, and understand the limitations of our current assembly tools.

Your program should be broken into two files. The first, debruijn.py will contain class definitions for your de Bruijn graph data structure. The second, assembly.py will be your main program that will orchestrate building, traversing, and interpreting the de Bruijn graph.


Getting Started

In your GitHub repo you should see the starter program files, as well as some example datasets. The first step is to construct a de Bruijn graph for a set of reads (in the FASTA file format). After creating the de Bruijn graph, we will traverse it to create contigs, which can then be written to a file. Some examples of the program execution are shown below to help you get an idea of what your program should do. They are explained in more detail below.

Here is some output from test1.fasta with k=3:

$ python3 assembly.py -r data/test1.fasta -k 3 -o results/test1

----------------------------------------
Welcome to the de Bruijn graph assembler
----------------------------------------

Creating de Bruijn graph with k=3...

Nodes: AA,AB,BA,BB

Edges:
AA: ['AA', 'AB']
AB: ['BB']
BB: ['BB', 'BB', 'BA']

Graph Eulerian? True

Starting path 1 / 1 from AA:
AAABBBBA

There should be two output files: results/test1_contigs.fasta and results/test1.png with a visualization of the graph. It should look like this:

Here is some output from test2.fasta with k=5:

$ python3 assembly.py -r data/test2.fasta -k 5 -o results/test2

----------------------------------------
Welcome to the de Bruijn graph assembler
----------------------------------------

Creating de Bruijn graph with k=5...

Nodes: _tim,g_ti,for_,er_a,ng_t,r_ev,long,nd_e,time,r_an,_lon,g_lo,a_lo,ver_,d_ev,ever,_eve,ng_l,_and,ong_,and_,or_e

Edges:
a_lo: ['_lon']
_lon: ['long', 'long', 'long']
long: ['ong_', 'ong_', 'ong_']
ong_: ['ng_l', 'ng_l', 'ng_t']
ng_l: ['g_lo', 'g_lo']
g_lo: ['_lon', '_lon']
ng_t: ['g_ti']
g_ti: ['_tim']
_tim: ['time']
for_: ['or_e']
or_e: ['r_ev']
r_ev: ['_eve']
_eve: ['ever', 'ever', 'ever']
ever: ['ver_', 'ver_']
ver_: ['er_a', 'er_a']
er_a: ['r_an', 'r_an']
r_an: ['_and', '_and']
_and: ['and_', 'and_']
and_: ['nd_e', 'nd_e']
nd_e: ['d_ev', 'd_ev']
d_ev: ['_eve', '_eve']

Graph Eulerian? False

Starting path 1 / 2 from for_:
for_ever_and_ever_and_ever

Starting path 2 / 2 from a_lo:
a_long_long_long_time

There should be two output files: results/test2_contigs.fasta and results/test2.png with a visualization of the graph. It should look like this:

Here is some output from test3.fasta with k=3:

$ python3 assembly.py -r data/test3.fasta -k 3 -o results/test3

----------------------------------------
Welcome to the de Bruijn graph assembler
----------------------------------------

Creating de Bruijn graph with k=3...

Nodes: TA,AZ,BC,ZT,TB,ZB,CD,CF,ZC,RZ,DF,TT,TX,RT,FR,ZR,FZ

Edges:
AZ: ['ZB', 'ZC']
ZB: ['BC']
BC: ['CD', 'CF']
CD: ['DF', 'DF']
DF: ['FZ', 'FR']
FZ: ['ZR']
ZR: ['RT']
RT: ['TT', 'TA']
TT: ['TB', 'TX']
TB: ['BC']
CF: ['FR']
FR: ['RT', 'RZ']
TA: ['AZ']
ZC: ['CD']
RZ: ['ZT']
ZT: ['TT']

Graph Eulerian? True

Starting path 1 / 1 from AZ:
AZCDFRZTTBCFRTAZBCDFZRTTX

There should be two output files: results/test3_contigs.fasta and results/test3.png with a visualization of the graph. It should look like this:


Step 1: Command line arguments

For this program we’ll start using the optparse library library for parsing command line arguments. Below is some boiler-plate code I use for almost every script I write in Python. It allows you to specify command line arguments with flags so the order of arguments does not matter.

import optparse
import sys

parser = optparse.OptionParser(description='program description')

parser.add_option('-a', '--arg1', type='string', help='helpful message 1')
parser.add_option('-b', '--arg2', type='int',    help='helpful message 2')

(opts, args) = parser.parse_args()

# you can include any subset of arguments as mandatory
# (if an argument is not mandatory, it should have a default)
mandatories = ['arg1', 'arg2']
for m in mandatories:
    if not opts.__dict__[m]:
        print('mandatory option ' + m + ' is missing\n')
        parser.print_help()
        sys.exit()

# later on in your code...
print(opts.arg1, opts.arg2)

For this lab you should have three command line arguments:


Step 2: de Bruijn graph (DBG) data strurcture

In the file debruijn.py you will define your de Bruijn graph data structure. At minimum, you should have a class for the de Bruijn graph (i.e. DBG). The methods below provide some guidance, but as long as you’re implementing a de Bruijn graph you’re welcome to make some reasonable modifications.

Note: you must have some testing in the main function in debruijn.py. I would recommend hard-coding test1.fasta for this part. Use if __name__ == "__main__": to prevent this testing from being executed when you import these classes into assembly.py. Make use of assert to ensure you’re getting the results you expect in your tests.

DBG class

Create a DBG class with the following functionality. Note that the reads should be read in from a FASTA file. All the reads in the FASTA file will be part of the same de Bruijn graph, which will (hopefully) be connected across reads.

import graphviz as gv

Here is an example of a small directed graph:

g = gv.Digraph(format='png')
g.node('A')
g.node('B')
g.edge('A', 'B')
g.render('img/graph') # adds .png to file path

I would recommend using ‘png’ for the format. The final image renders to img/g2.png in this example. (Note that for our purposes we need to create nodes and edges from our own data structures.)

More graphviz examples here. For our application, create a directed graph in this style (with (k-1)-mers as the nodes) so the user can visualize the results.


Step 3: Assembly methods

Inside the DBG class you will also need methods for traversing the graph to create assemblies (in the form of contigs). Here are some suggestions (test each method on test1.fasta).


Step 4: Putting it all together: assembly workflow

You will define your main program in assembly.py. At a high level, your main should parse the command line arguments, build the DBG, then traverse it to create contigs, and finally write these contigs to a file.

Here are the specific details, but feel to add more functionality or print additional information along the way (in particular, you might want to print out the number of balanced and semi-balanced nodes).

Practice files

There are 4 test files to practice on: test1.fasta, test2.fasta, test3.fasta, and genomeX_reads.fasta. Make sure the first 3 work, the fourth is optional (details below).

(Optional: genomeX_reads.fasta) For the fourth one, the reference (original) genome is provided as well: genomeX_reference.fasta. If you do run the fourth example, you can see how closely your assembly matches the original using the online tool BLAST and upload the two FASTA files. In the next few weeks we will see how to do this step (called sequence alignment) ourselves. It is okay if your assembly for this last test case has some issues, the coverage is good but not completely uniform. Experiment with k. In your README, explain which value of k produced the best assembly.

Program Requirements

In addition to the requirements listed above, you should ensure your code satisfies these general guidelines (as well as the style guidelines linked on Piazza):

Submitting your work

Make sure to fill out the README file as well as the results for the first three test cases. The files you should have at the end are (at a minimum):


Extensions (optional)

  1. Create a runtime plot

    This extension ask you to analyze the runtime of building a de Bruijn graph. For this I would recommend using the genomeX reads and vary the number of reads you consider. Then plot the runtime as a function of the number of reads, in runtime.py. I will show the best runtime visualizations in class. Here is a rough sketch of how to use matplotlib for this purpose. I am happy to help get this working further.

import matplotlib.pyplot as plt
x = range(10)
y = [val**2 for val in x]
plt.plot(x,y,'bo-') # make sure x and y are the same length
plt.title("My Plot")
plt.xlabel("x-axis label")
plt.ylabel("y-axis label")
plt.legend(["line 1"])
plt.show()