CS68 Lab 3: Pairwise Sequence Alignment

Due: Wednesday, February 14 at 11:59pm

Overview

The goals of this week's lab:

Reinforce the general concept of dynamic programming
Apply various algorithms for pairwise sequence alignment on a pair of sample sequences
Connect biology to theory through interpretation of algorithm parameters and results
Implement the Smith-Waterman local alignment algorithm
Analyze new research directions beyond the lecture material

Local alignment with linear gap penalty

Implement a program that uses dynamic programming to compute a pairwise local alignment using a linear gap penalty function. The program should work on protein sequences (you can generalize it to DNA sequences as well, but this is not required).

Getting started

Find your partner and clone your lab03 git repo as usual. You should see following starter files:

README.md - questionnaire to be completed for your final submission
local_align.py - the main program for implementing your sequence alignment algorithm
data/ directory - sample sequences to align including one toy example and 3 gene pairs
blosum-62 - contains values for the BLOSUM-62 matrix to score substitutions

You are welcome to add additional python files as for importing into local_align.py.

Usage

Your program should take in 5 command-line arguments using the optparse library described in Lab 2:

The name of a file containing the first sequence to align -a
The name of a file containing the second sequence to align -b
The name of a file containing the substitution matrix -m
The gap penalty -g
The name of the output file to save the alignment -o

For example:

python3 local_align.py -a data/test1.fasta -b data/test2.fasta -m blosum-62 -g -3 -o test_align.txt

Using the template from last time, make sure the user is informed if they misspecify arguments or do not include required arguments. You should also error check all parameters to ensure that the files exist and the gap penalty is negative. You could also have a default (i.e. -1 or -2) for the gap penalty if the user doesn't specify it.

The steps below do not have to be done in any specific order (see example outputs for an idea of the end goal), but here is a suggestion of how to proceed. Make sure to test and debug each step, as well as keep in mind good function separation and top down design. There is less structure for this assignment and more freedom in how you want to implement your program.

Reading in the substitution matrix

Reading in the substitution matrix will require some careful thought and testing. There are several options:

One method is to simply store the values as a 20 by 20 two-dimensional array of integers and then have a separate function that maps amino acid characters to an index.
Another technique is to use a two-dimensional dictionary. Each row in the file will translate into a dictionary. They key is the character at the beginning of the line. The value it maps to is itself a dictionary containing all of the (column, value) pairs for that row. If I do this in python for just the first two rows and columns and print the resulting dictionary of dictionaries, I get the following value:

{'A': {'A': 4, 'R': -1}, 'R': {'A': -1, 'R': 5}}

The first pair in the dictionary has a key 'A' and value which itself is a dictionary {'A': 4, 'R': -1}. If I want the substitution value for replacing an 'A' with an 'R', I simple call:

>>> print(m['A']['R']
-1

where m is the dictionary structure I stored the values in. It's a bit more complicated to think about, but easier to use since you will not need to map the amino acids to integers.

A third approach is to stick with the two-dimensional array, but place it in a class SubstitutionMatrix such that you can create an object and call a get_score method that does all of the mapping for you.
A final option is to still use a 2D array, but make it a numpy array. We'll be using numpy later on this this course, and this is a good opportunity to get some practice with numpy. Here is an example of creating a 2D array of zeros:

python3
>>> import numpy as np
>>> m = np.zeros([4,5])
>>> m
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
>>> m[2][1] = 7
>>> m
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  7.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
>>> m[:,1] # slice the one-th column of m
array([ 0.,  0.,  7.,  0.])
>>> m[2] # get the two-th row of m
array([ 0.,  7.,  0.,  0.,  0.])

Any of these techniques are acceptable, but your code should be readable and well designed.

Creating the Dynamic Programming (DP) table

For the DP table, there are also several ways to approach the data structure design. You could use a 2D array (list-of-lists), or a numpy array similar to the example above. Make sure to initialize the "base case" (0th row and column) to 0's. (If you're already using a numpy array of zeros though, you don't need to do this part.) Think about the dimensions of this array, and make sure to have x along the rows (left-hand-side), and y along the columns (right-hand-side).

If you are doing the 2D array option (not in numpy), make sure to keep the following issue in mind. The code below will not work for initializing a 2D array!

>>> m = [[0]*5]*4
>>> m
[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]
>>> m[2][1] = 7
>>> m
[[0, 7, 0, 0, 0], [0, 7, 0, 0, 0], [0, 7, 0, 0, 0], [0, 7, 0, 0, 0]]

Back-tracing to obtain the alignment(s)

For this program, we will obtain local alignments starting from the highest score. This will include multiple instances of the highest score, but NOT all possible tracebacks from each highest score.

To be consistent, and obtain the same output as I have below, break ties using the "high road" protocol where I will go "up" first, then "diagonal", then "left" (in the event of the maximum being the same for multiple paths). Also make sure to end alignments whenever you see a cell with a zero. To summarize the "high road" protocol:

UP: insert gap in y (i.e., output character in first sequence x, gap in second sequence y)
DIAG: match the characters in x and y
LEFT: insert gap in x (i.e., output character in second sequence y, gap in first sequence x)

If there are multiple substrings with the same maximum score, output them all. You can worry about this requirement after you get your program working for just one maximum.

Your data structure for back-tracing will need to work with your dynamic programming table. It is possible to combine these two data structures or leave them separate. Regardless, after building the table and recording where you "came from" for each cell, you'll need to potentially traceback from several different cells. Think about how to orchestrate these different pieces. Classes may be helpful but are not necesssary.

Input data

I have provided several pairs of sample files in your directory:

test - a pair of small proteins in test1.fasta and test2.fasta
calcium - a pair of related calcium ion proteins - human calmodulin and calpain
kinases - a pair of yeast kinase proteins - STE7 and MKK1
FOXP2 - FOXP2 genes from Human, Chimp, and Gorilla (see end of lab for FOXP2 analysis)

Input files will follow a similar format the FASTA format where the first line is a comment describing the sequence, and then each line after that is part of the sequence.

Output

At a minimum, your output should contain:

The maximal alignment score.
The optimal alignments of one subsequence to the other, including gaps. You should break ties within one alignment as listed above. If there is a tie for optimal score at different locations in the score matrix, you should output them all on separate lines.
The sequence identity of the alignment (number of matching characters over the total length of the alignment), as well as the number of gaps (see output below).
The (1-based, inclusive) index of the substrings making up the alignment from the original sequences. That is, the location of the corresponding amino acids (not counting gaps), for each sequence. As an example, the alignment for MSTYG to AXSTXYA results in:

    2 ST-Y 4
      || |
    3 STXY 6

indicating the first sequence, characters 2 to 4 inclusive, align with the second sequence, characters 3 through 6 inclusive. The alignment should be written to the given output file, with no formatting necessary, 1 sequence per line. For example, your file for the above alignment should just have:

ST-Y
STXY

Do not write scores and indices to the file.

Sample output

Below our the outputs from the provided examples. The standard output of the alignment includes outputting 50 characters per line for each sequence, with the start and ending character indices indicated as well as markers to indicate perfect alignments ('|') and substitutions ('.'). Match this output as closely as possible, but you are welcome to include additional information. Below are the outputs for several of the examples:

Test Example, gap penalty of -3: Printed output, File output
Calcium pair, gap penalty of -7: Printed output, File output
Kinase proteins, gap penalty of -5: Printed output, File output

Additional requirements

Your program should follow good design practices - effective top-down design, modularity, commenting for functions and non-trivial code, etc.
You may submit as many additional files as you wish. For example, any classes you define can be defined in a separate file. But the main program should be in local_align.py
Practice defensive programming; e.g., did the user provide enough arguments on the command line? Do the files exist? You do not need to check if the contents are correct.
Do not interact with the user for your program - if an incorrect filename is given, simply exit the program with a message; do not prompt for a new file.
Clean up any debugging print statements. It will be useful to print out matrices when debugging, but be sure to remove this when handing in.

Extensions (optional)

There are several ways to extend this assignment. I would recommend both of the options below since they are not too much work and make connections with both past and future material.

Create a option for aligning DNA vs. Protein. For DNA you can use a simplified mutation matrix (i.e. +1 for match, -1 for mismatch). Then apply your method to genomeX_reference.fasta and your output contigs from Lab 2. See how similar your alignment output is to BLAST and if it is faster or slower :)
Create a visualization of your alignment. The one below uses a heatmap in matplotlib to show areas of high and low alignment scores. It would also be interesting to create a visualization that looks like the tables we've been creating in class (with numbers in each box, arrows, tracing back, etc). One way to do this is to use plt.text(x, y, s) where (x,y) represents the location of the text and s is the string of text. Here is an example of how to create a heatmap:

import matplotlib.pyplot as plt
...
# create dynamic programming table (2D array)
...
heatmap = plt.imshow(my_2D_array, cmap='jet', interpolation='nearest')
plt.colorbar(heatmap)
plt.show()

There are many awesome color schemes to choose from for the cmap parameter. The one above uses "jet".

FOXP2 analysis and submitting your work

The last part of the lab is to run your algorithm on the FOXP2 genes for Species A, Species B, and Species C, which represent Human, Chimp, and Gorilla in some order. FOXP2 is an important gene associated with speech and language, and one amino acid substitution in humans is not shared by any other species. Since right now we have only done pairwise alignment, run your method on all three pairs of these species (with g=-7). Then record your results in the README file and answer a few questions about how to interpret these results biologically.

Make sure to push your code often, only the final version will be submitted. If you create plots, include them as well, but you do not need to push other output files (unless you want to).

Credit: modified from lab by Ameet Soni