CS68 Lab 2: Genome Assembly

Due: Wednesday, February 7 at 11:59pm


Overview

In this lab you will implement your own de Bruijn graph assembler and apply it to a realistic (although synthetic) dataset. The goals of this lab are to develop a deeper understanding of de Bruijn graphs, gain experience going from theory to application, and understand the limitations of our current assembly tools.

Your program should be broken into two files. The first, debruijn.py will contain class definitions for your de Bruijn graph data structure. The second, assembly.py will be your main program that will orchestrate building, traversing, and interpreting the de Bruijn graph.


Getting Started

Find your partner and clone your lab02 git repo as usual. You should see the starter program files, as well as some example datasets. The first step is to construct a de Bruijn graph for a set of reads (in the FASTA file format, same as Lab 1). After creating the de Bruijn graph, we will traverse it to create contigs, which can then be written to a file. Some examples of the program execution are shown below to help you get an idea of what your program should do. They are explained in more detail below.

Here is some output from test1.fasta with k=3:

$ python3 assembly.py -r data/test1.fasta -k 3
----------------------------------------
Welcome to the de Bruijn graph assembler
----------------------------------------

Creating de Bruijn graph with k=3...DONE

Display text version of graph (y/n): y

Nodes: AA,AB,BB,BA

Edges:
AA: ['AA', 'AB']
AB: ['BB']
BB: ['BB', 'BB', 'BA']

Write graph visualization to file (y/n): y
Enter filename prefix: figs/test1

Determine whether graph is Eulerian (y/n): y
Graph Eulerian? True

Traverse graph and find contigs (y/n): y
Write contigs to file (y/n): y
Enter contig filename: results/test1_contigs.fasta

Starting path 1/1 from AA:
>contig_0
AAABBBBA

In the file figs/test1.png there is a visualization of the graph. It should look like this:

Here is some output from test2.fasta with k=5:

$ python3 assembly.py -r data/test2.fasta -k 5
----------------------------------------
Welcome to the de Bruijn graph assembler
----------------------------------------

Creating de Bruijn graph with k=5...DONE

Display text version of graph (y/n): y

Nodes: a_lo,_lon,long,ong_,ng_l,g_lo,ng_t,g_ti,_tim,time,for_,or_e,r_ev,_eve,ever,ver_,er_a,r_an,_and,and_,nd_e,d_ev

Edges:
a_lo: ['_lon']
_lon: ['long', 'long', 'long']
long: ['ong_', 'ong_', 'ong_']
ong_: ['ng_l', 'ng_l', 'ng_t']
ng_l: ['g_lo', 'g_lo']
g_lo: ['_lon', '_lon']
ng_t: ['g_ti']
g_ti: ['_tim']
_tim: ['time']
for_: ['or_e']
or_e: ['r_ev']
r_ev: ['_eve']
_eve: ['ever', 'ever', 'ever']
ever: ['ver_', 'ver_']
ver_: ['er_a', 'er_a']
er_a: ['r_an', 'r_an']
r_an: ['_and', '_and']
_and: ['and_', 'and_']
and_: ['nd_e', 'nd_e']
nd_e: ['d_ev', 'd_ev']
d_ev: ['_eve', '_eve']

Write graph visualization to file (y/n): y
Enter filename prefix: figs/test2

Determine whether graph is Eulerian (y/n): y
Graph Eulerian? False

Traverse graph and find contigs (y/n): y
Write contigs to file (y/n): y
Enter contig filename: results/test2_contigs.fasta

Starting path 1/2 from a_lo:
>contig_0
a_long_long_long_time

Starting path 2/2 from for_:
>contig_1
for_ever_and_ever_and_ever

In the file figs/test2.png there is a visualization of the graph. It should look like this:


de Bruijn graph (DBG) assembly

In the file debruijn.py you will define your de Bruijn graph data structure. At minimum, you should have a class for the de Bruijn graph (i.e. DBG) and a Node class for the (k-1)-mer nodes. The methods below provide some guidance, but as long as you're implementing a de Bruijn graph you're welcome to make some reasonable modifications.

Note: you must have some testing in the main function in debruijn.py. I would recommend using test1.fasta for this part. Use if __name__ == "__main__": to prevent this testing from being executed when you import these classes into assembly.py. Make use of assert to ensure you're getting the results you expect in your tests.

Node and DBG classes

First create a Node class (this will be used as part of your DBG class). You are welcome to add more to this class, but at a minimum it should include:

In conjunction with your Node class, create a DBG class with the following functionality. Note that the reads should be read in from a FASTA file. All the reads in the FASTA file will be part of the same de Bruijn graph, which will (hopefully) be connected across reads.

import graphviz as gv

Here is an example of a small directed graph:

g2 = gv.Digraph(format='svg')
g2.node('A')
g2.node('B')
g2.edge('A', 'B')
g2.render('img/g2')

I would recommend using 'png' for the format (you don't have to push the graphs though, so whatever you prefer is fine). The final image renders to img/g2.svg in this example.

More graphviz examples here. For our application, create a directed graph in this style (with (k-1)-mers as the nodes) so the user can visualize the results.

Main program

You will define your main program in assembly.py. At a high level, your main should parse the command line arguments and then allow the user to choose the workflow for their DBG assembly. Below are some details about each part.

Parsing Command Line Arguments

For this program we'll start using the optparse library library for parsing command line arguments. Below is some boiler-plate code I use for almost every script I write in Python. It allows you to specify command line arguments with flags so the order of arguments does not matter.

import optparse
import sys

parser = optparse.OptionParser(description='program description')

parser.add_option('-a', '--arg1', type='string', help='helpful message 1')
parser.add_option('-b', '--arg2', type='int',    help='helpful message 2')

(opts, args) = parser.parse_args()

# you can include any subset of arguments as mandatory
# (if an argument is not mandatory, it should have a default)
mandatories = ['arg1', 'arg2']
for m in mandatories:
    if not opts.__dict__[m]:
        print('mandatory option ' + m + ' is missing\n')
        parser.print_help()
        sys.exit()

# later on in your code...
print(opts.arg1, opts.arg2)

For this program you should have at least -r for the FASTA file of reads and -k for the k-mer length. If you include other arguments, make sure to document them. After getting the command line arguments, you can use them with opts.arg1, etc. Make sure to include more helpful names than arg1 and arg2.

Assembly Workflow

After parsing the arguments and building the de Bruijn graph, ask the user several questions about what they would like to do as the graph is analyzed and traversed. For all of these questions, y means yes and anything else means no (so no need to do extensive checking of user input). The reason we don't do all these steps automatically is that for large graphs, it may not be desirable to print or visualize the entire graph. And for small graphs, we may not feel the need to write the contigs to a file explicitly. Note that these steps are only offered once per execution of the program (no need to have a loop unless you want to).

Here are the specific details, but feel to add more functionality or print additional information along the way (in particular, you might want to print out the number of balanced and semi-balanced nodes).


Requirements and Tips

There are 3 test files to practice on: test1.fasta, test2.fasta, and genomeX_reads.fasta. For the third one, the reference (original) genome is provided as well: genomeX_reference.fasta. To see how closely your assembly matches the original, you can use the online tool BLAST and upload the two FASTA files. In the next few weeks we will see how to do this step (called sequence alignment) ourselves.

It is okay if your assembly for the last test case has some issues, the coverage is good but not completely uniform. Experiment with k. In your README, explain which value of k produced the best assembly (there are a few questions to answer about your best assembly). You should also push the contigs for your best k value for this example.

Program Requirements

In addition to the requirements listed above, you should ensure your code satisfies these general guidelines:


Extensions (optional)

  1. Create a runtime plot

This extension ask you to analyze the runtime of building a de Bruijn graph. For this I would recommend using the E.coli reads and vary the number of reads you consider. Then plot the runtime as a function of the number of reads, in runtime.py. I will show the best runtime visualizations in class. Here is a rough sketch of how to use matplotlib for this purpose. I am happy to help get this working further.

import matplotlib.pyplot as plt
x = range(10)
y = [val**2 for val in x]
plt.plot(x,y,'bo-') # make sure x and y are the same length
plt.title("My Plot")
plt.xlabel("x-axis label")
plt.ylabel("y-axis label")
plt.legend(["line 1"])
plt.show()
  1. Assemble the E.coli genome

I have also provided a set of reads from the E.coli genome from last time: /home/smathieson/public/cs68/ecoli_reads.fasta. The goal is to assemble them into the best possible contigs (the reference is available to you so you can check your solution with BLAST as you go). Any algorithm is fine as long as it is something you come up with (i.e. you cannot just run the reads through Velvet!) Feel free to use literature such as the Velvet paper for inspiration, just not any one else's code (that applies for the rest of the lab as well). If you attempt this part, include your solution in challenge.py, which could also make use of data structures (new or ones you had before) in debruijn.py.

For the challenge, make it clear in your README how I should run your code, which parameters, etc. But don't push your contigs since they will likely be quite large. There will be a prize for the group with the best assembly :)

Submitting your work

Make sure to fill out the README file (including some questions about your best choice for k for the third test case). Make sure to also push your best assembly for genomeX. The files you should have at the end are: