CS68 Lab 3: Pairwise Sequence Alignment

Due: Wednesday, February 14 at 11:59pm


Overview

The goals of this week's lab:


Local alignment with linear gap penalty

Implement a program that uses dynamic programming to compute a pairwise local alignment using a linear gap penalty function. The program should work on protein sequences (you can generalize it to DNA sequences as well, but this is not required).

Getting started

Find your partner and clone your lab03 git repo as usual. You should see following starter files:

You are welcome to add additional python files as for importing into local_align.py.

Usage

Your program should take in 5 command-line arguments using the optparse library described in Lab 2:

  1. The name of a file containing the first sequence to align -a
  2. The name of a file containing the second sequence to align -b
  3. The name of a file containing the substitution matrix -m
  4. The gap penalty -g
  5. The name of the output file to save the alignment -o

For example:

python3 local_align.py -a data/test1.fasta -b data/test2.fasta -m blosum-62 -g -3 -o test_align.txt

Using the template from last time, make sure the user is informed if they misspecify arguments or do not include required arguments. You should also error check all parameters to ensure that the files exist and the gap penalty is negative. You could also have a default (i.e. -1 or -2) for the gap penalty if the user doesn't specify it.

The steps below do not have to be done in any specific order (see example outputs for an idea of the end goal), but here is a suggestion of how to proceed. Make sure to test and debug each step, as well as keep in mind good function separation and top down design. There is less structure for this assignment and more freedom in how you want to implement your program.

Reading in the substitution matrix

Reading in the substitution matrix will require some careful thought and testing. There are several options:

  1. One method is to simply store the values as a 20 by 20 two-dimensional array of integers and then have a separate function that maps amino acid characters to an index.

  2. Another technique is to use a two-dimensional dictionary. Each row in the file will translate into a dictionary. They key is the character at the beginning of the line. The value it maps to is itself a dictionary containing all of the (column, value) pairs for that row. If I do this in python for just the first two rows and columns and print the resulting dictionary of dictionaries, I get the following value:

{'A': {'A': 4, 'R': -1}, 'R': {'A': -1, 'R': 5}}

The first pair in the dictionary has a key 'A' and value which itself is a dictionary {'A': 4, 'R': -1}. If I want the substitution value for replacing an 'A' with an 'R', I simple call:

>>> print(m['A']['R']
-1

where m is the dictionary structure I stored the values in. It's a bit more complicated to think about, but easier to use since you will not need to map the amino acids to integers.

  1. A third approach is to stick with the two-dimensional array, but place it in a class SubstitutionMatrix such that you can create an object and call a get_score method that does all of the mapping for you.

  2. A final option is to still use a 2D array, but make it a numpy array. We'll be using numpy later on this this course, and this is a good opportunity to get some practice with numpy. Here is an example of creating a 2D array of zeros:

python3
>>> import numpy as np
>>> m = np.zeros([4,5])
>>> m
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
>>> m[2][1] = 7
>>> m
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  7.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
>>> m[:,1] # slice the one-th column of m
array([ 0.,  0.,  7.,  0.])
>>> m[2] # get the two-th row of m
array([ 0.,  7.,  0.,  0.,  0.])

Any of these techniques are acceptable, but your code should be readable and well designed.

Creating the Dynamic Programming (DP) table

For the DP table, there are also several ways to approach the data structure design. You could use a 2D array (list-of-lists), or a numpy array similar to the example above. Make sure to initialize the "base case" (0th row and column) to 0's. (If you're already using a numpy array of zeros though, you don't need to do this part.) Think about the dimensions of this array, and make sure to have x along the rows (left-hand-side), and y along the columns (right-hand-side).

If you are doing the 2D array option (not in numpy), make sure to keep the following issue in mind. The code below will not work for initializing a 2D array!

>>> m = [[0]*5]*4
>>> m
[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]
>>> m[2][1] = 7
>>> m
[[0, 7, 0, 0, 0], [0, 7, 0, 0, 0], [0, 7, 0, 0, 0], [0, 7, 0, 0, 0]]

Back-tracing to obtain the alignment(s)

For this program, we will obtain local alignments starting from the highest score. This will include multiple instances of the highest score, but NOT all possible tracebacks from each highest score.

To be consistent, and obtain the same output as I have below, break ties using the "high road" protocol where I will go "up" first, then "diagonal", then "left" (in the event of the maximum being the same for multiple paths). Also make sure to end alignments whenever you see a cell with a zero. To summarize the "high road" protocol:

  1. UP: insert gap in y (i.e., output character in first sequence x, gap in second sequence y)
  2. DIAG: match the characters in x and y
  3. LEFT: insert gap in x (i.e., output character in second sequence y, gap in first sequence x)

If there are multiple substrings with the same maximum score, output them all. You can worry about this requirement after you get your program working for just one maximum.

Your data structure for back-tracing will need to work with your dynamic programming table. It is possible to combine these two data structures or leave them separate. Regardless, after building the table and recording where you "came from" for each cell, you'll need to potentially traceback from several different cells. Think about how to orchestrate these different pieces. Classes may be helpful but are not necesssary.

Input data

I have provided several pairs of sample files in your directory:

Input files will follow a similar format the FASTA format where the first line is a comment describing the sequence, and then each line after that is part of the sequence.

Output

At a minimum, your output should contain:

    2 ST-Y 4
      || |
    3 STXY 6

indicating the first sequence, characters 2 to 4 inclusive, align with the second sequence, characters 3 through 6 inclusive. The alignment should be written to the given output file, with no formatting necessary, 1 sequence per line. For example, your file for the above alignment should just have:

ST-Y
STXY

Do not write scores and indices to the file.

Sample output

Below our the outputs from the provided examples. The standard output of the alignment includes outputting 50 characters per line for each sequence, with the start and ending character indices indicated as well as markers to indicate perfect alignments ('|') and substitutions ('.'). Match this output as closely as possible, but you are welcome to include additional information. Below are the outputs for several of the examples:

Additional requirements

Extensions (optional)

There are several ways to extend this assignment. I would recommend both of the options below since they are not too much work and make connections with both past and future material.

  1. Create a option for aligning DNA vs. Protein. For DNA you can use a simplified mutation matrix (i.e. +1 for match, -1 for mismatch). Then apply your method to genomeX_reference.fasta and your output contigs from Lab 2. See how similar your alignment output is to BLAST and if it is faster or slower :)

  2. Create a visualization of your alignment. The one below uses a heatmap in matplotlib to show areas of high and low alignment scores. It would also be interesting to create a visualization that looks like the tables we've been creating in class (with numbers in each box, arrows, tracing back, etc). One way to do this is to use plt.text(x, y, s) where (x,y) represents the location of the text and s is the string of text. Here is an example of how to create a heatmap:

import matplotlib.pyplot as plt
...
# create dynamic programming table (2D array)
...
heatmap = plt.imshow(my_2D_array, cmap='jet', interpolation='nearest')
plt.colorbar(heatmap)
plt.show()

There are many awesome color schemes to choose from for the cmap parameter. The one above uses "jet".

FOXP2 analysis and submitting your work

The last part of the lab is to run your algorithm on the FOXP2 genes for Species A, Species B, and Species C, which represent Human, Chimp, and Gorilla in some order. FOXP2 is an important gene associated with speech and language, and one amino acid substitution in humans is not shared by any other species. Since right now we have only done pairwise alignment, run your method on all three pairs of these species (with g=-7). Then record your results in the README file and answer a few questions about how to interpret these results biologically.

Make sure to push your code often, only the final version will be submitted. If you create plots, include them as well, but you do not need to push other output files (unless you want to).

Credit: modified from lab by Ameet Soni