CSC 334: Topics in Computational Biology

Homework 2: Genome Assembly

Due: Wednesday, Sept. 30, 11:59pm on Moodle

The goal of this assignment is to practice installing and using software that others have written. We'll be installing the genome assembler Velvet, and using it to assemble an example genome. If you don't have a Mac or a Linux system, I would recommend using the Mac option on the lab computers in Ford 241.

  1. Download the current version of Velvet from here and unzip the file. Put the resulting folder somewhere that makes sense for you. (Sometimes I create a "Programs" folder and put software there, or you could put it in a folder related to this class.)
  2. Install Velvet. Open up the terminal and move into the Velvet folder (using the "cd" command), i.e.
     cd ~ssheehan/Programs/velvet_1.2.10/ 
    Then run "make" which will install Velvet:
    You might get some warnings, that's alright. If you get errors, send me the error message.
  3. Now we'll run Velvet on the example data in the "data" folder within the Velvet folder. First create a folder (something like "assembly1") where the output of Velvet will be stored. Then run the following two commands:
     velveth assembly1/ 21 -shortPaired data/test_reads.fa 
     velvetg assembly1/ 
  4. Using the Velvet Manual as a guide, answer the following questions in a .txt, .pdf, or .doc file. What does the number "21" represent in the command above? What is the length of the reads in "test_reads.fa"? What does "velveth" do? What does "velvetg" do?
  5. Investigate the output in the assembly1 folder. The "contigs.fa" file contains the resulting assembly (final output of the program). In your file, record how many contigs there are and what is the N50 of these contigs (this should be output by the program). In this case, the "true" genome is in the data folder: "test_reference.fa".
  6. The last step is to try to get the "best" assembly possible, as measured by N50. Try varying the the "k" parameter, with each assembly in a different folder. Record the number of contigs and the N50 of the assembly for each "k" value. Which value of "k" did the best?
  7. Choose one of the following options:

    A) Create a procedure for finding the sequence length of the "true" genome ("test_reference.fa"). How does its length compare to the length of your best assembly?

    B) In the "data" folder there is also a file of long reads. Use the Velvet manual to figure out how to use both long and short reads in your assembly. Record your results. Can you get a better N50 using both types of reads?

  8. Save your answers (.txt, .pdf, or .doc file) and submit the file on Moodle.
OPTIONAL EXTENSION: Create a script to compute N50 from a "stats.txt" file.