CSC 334: Topics in Computational Biology

Homework 1: String Alignment

Due: Tuesday, Sept. 22, 11:59pm on Moodle

The goal of this assignment is to implement the string alignment algorithm we've been discussing class, also known as Needleman-Wunsch (variation: Smith-Waterman algorithm).

  1. Download hw1.py. First complete the method
    def match(b1, b2): 
  2. which defines the score for pairing base b1 with base b2.

  3. Now begin the method
    def align(s1, s2): 
    which will print the optimal alignment between string s1 and s2 (i.e. s1 = 'GAGTAC' and s2 = 'GTAGCA'), and return the optimal score. First set up the dynamic programming (DP) table. What should the dimensions be for this table? You are welcome to use numpy arrays, or any other data structure you like.
  4. Initialize your DP table (first row and column).
  5. Implement the recursive step of the string alignment algorithm. Suggestion: print each line of your DP table to make sure it's doing the right thing or catch bugs.
  6. Termination step: return the optimal alignment score. In the testing section of hw1.py, print your score return values for the 4 examples to make sure the "forward" part of your algorithm is working.
  7. Implement backtracing to find an optimal alignment (for this assignment you only need to return one alignment). First you might want to choose a good data structure to hold the backtracing "arrows". Then put backtracing into the for loop over all the cells. Finally, use your backtracing data structure to find *an* optimal alignment. Hint: use a while loop.
  8. Test your code on the testing examples to make sure its working. I will test it on some other examples. The final example is real data from a human gene and a gorilla gene (FOXP2). Save your code as
     hw1_first_last.py 
    (first and last name), and submit it on Moodle.
OPTIONAL EXTENSION: Return *all* optimal alignments.

Note: you can receive full credit for this assignment without doing the extension.

UPDATE: Biology direction option: instead of implementing backtracing, run your program on the following three sequences from three different species (run pairs at a time). From this output, what tree do you think relates these species? You can put your answer as a comment in your code, along with the scores for each pair.

species 1: ATTAGATCGTACCACTATAAGTTTAC
species 2: CTTAGATCGTTCCACACATATAC
species 3: ATTAGATCGTACCACATATATTA