CSC 334: Topics in Computational Biology
Homework 1: String Alignment
Due: Tuesday, Sept. 22, 11:59pm on Moodle
The goal of this assignment is to implement the string alignment algorithm we've been discussing class, also known as
Needleman-Wunsch (variation: Smith-Waterman algorithm).
- Download hw1.py. First complete the method
def match(b1, b2):
which defines the score for pairing base b1 with base b2.
- Now begin the method
def align(s1, s2):
which will print the optimal alignment between string s1 and s2
(i.e. s1 = 'GAGTAC' and s2 = 'GTAGCA'), and return the optimal
score. First set up the dynamic programming (DP) table. What should
the dimensions be for this table? You are welcome to use numpy
arrays, or any other data structure you like.
- Initialize your DP table (first row and column).
- Implement the recursive step of the string alignment algorithm.
Suggestion: print each line of your DP table to make sure it's doing
the right thing or catch bugs.
- Termination step: return the optimal alignment score. In the
testing section of hw1.py, print your score return values for the 4
examples to make sure the "forward" part of your algorithm is working.
- Implement backtracing to find an optimal alignment (for this
assignment you only need to return one alignment). First you might
want to choose a good data structure to hold the backtracing
"arrows". Then put backtracing into the for loop over all the
cells. Finally, use your backtracing data structure to find *an*
optimal alignment. Hint: use a while loop.
- Test your code on the testing examples to make sure its
working. I will test it on some other examples. The final example is
real data from a human gene and a gorilla gene (FOXP2). Save your code as
hw1_first_last.py
(first and last name), and submit it on Moodle.
OPTIONAL EXTENSION: Return *all* optimal alignments.
Note: you can receive full credit for this assignment without doing
the extension.
UPDATE: Biology direction option: instead of implementing
backtracing, run your program on the following three sequences from
three different species (run pairs at a time). From this output, what
tree do you think relates these species? You can put your answer as a
comment in your code, along with the scores for each pair.
species 1: ATTAGATCGTACCACTATAAGTTTAC
species 2: CTTAGATCGTTCCACACATATAC
species 3: ATTAGATCGTACCACATATATTA