CSC 334: Topics in Computational Biology
Homework 2: Genome Assembly
Due: Wednesday, Sept. 30, 11:59pm on Moodle
The goal of this assignment is to practice installing and using
software that others have written. We'll be installing the genome
assembler Velvet, and using it to assemble an example genome. If you
don't have a Mac or a Linux system, I would recommend using the Mac
option on the lab computers in Ford 241.
- Download the current version of Velvet from here and unzip the
file. Put the resulting folder somewhere that makes sense for
you. (Sometimes I create a "Programs" folder and put software there,
or you could put it in a folder related to this class.)
- Install Velvet. Open up the terminal and move into the Velvet
folder (using the "cd" command), i.e.
cd ~ssheehan/Programs/velvet_1.2.10/
Then run "make" which will install Velvet:
make
You might get some warnings, that's alright. If you get errors, send
me the error message.
- Now we'll run Velvet on the example data in the "data" folder
within the Velvet folder. First create a folder (something like
"assembly1") where the output of Velvet will be stored. Then run the
following two commands:
velveth assembly1/ 21 -shortPaired data/test_reads.fa
velvetg assembly1/
- Using the Velvet Manual as a guide,
answer the following questions in a .txt, .pdf, or .doc file. What
does the number "21" represent in the command above? What is the
length of the reads in "test_reads.fa"? What does "velveth" do? What
does "velvetg" do?
- Investigate the output in the assembly1 folder. The
"contigs.fa" file contains the resulting assembly (final output of
the program). In your file, record how many contigs there are and what is the N50
of these contigs (this should be output by the program). In this
case, the "true" genome is in the data folder:
"test_reference.fa".
- The last step is to try to get the "best" assembly possible, as
measured by N50. Try varying the the "k" parameter, with each
assembly in a different folder. Record the number of contigs and the
N50 of the assembly for each "k" value. Which value of "k" did the
best?
- Choose one of the following options:
A) Create a procedure for finding the sequence length of the "true" genome
("test_reference.fa"). How does its length compare to the length of
your best assembly?
B) In the "data" folder there is also a file of long reads. Use the
Velvet manual to figure out how to use both long and short reads in
your assembly. Record your results. Can you get a better N50 using both types of reads?
- Save your answers (.txt, .pdf, or .doc file) and submit the file
on Moodle.
OPTIONAL EXTENSION: Create a script to compute N50 from a "stats.txt" file.