Lab 10
Due: Friday, Apr. 21 at 11:59pm on Moodle
For this lab, first find your randomly assigned partner. Introduce yourselves - the person with the first name that comes first alphabetically should begin as the "driver", with the other partner as the "navigator". The driver will have the code open, and the navigator will have these instructions open.
At the end of the lab, email the code (finished or not) and transcript to the person who started as the navigator. If you do not finish during lab you have two options:
(1) Arrange to meet before Friday and finish the lab together.
(2) Continue the code separately and denote the part you did on your own with a comment.
Note: it is not an option for one person to complete the code on their own and then send the finished code to their partner to submit. Any code that you submit should be either written by you, or written by you and your partner while you were pair programming. Both partners should submit their code on Moodle.
The steps below will help you develop a program to compute pairwise differences.
First create a new class called Genome. This class will represent the information for the genome of a single species. Each instance of this class should know two things: their species name (string) and their DNA sequence (also a string). Based on this information, write the constructor for this class (it will be a lot shorter than our graphics constructors - no need to do anything complicated!) In addition to the constructor, write two getter methods for each instance variable:
After you have written the constructor, start your main function. The first step in main will be to read the file of DNA information and construct a list of Genomes. This is very similar to Homework 9 with the fish, except that we will need to read a file as well. At the end of this step, you should have a list of Genomes (just one list, not two separate lists of the names and the sequences). Store the length of this list as a variable, and print the variable to make sure it is equal to 8.
Notes: when viewing the file it looks like the name and the sequence are on separate lines, but they are actually on the same line. Make sure to use split(..) to separate the name from the sequence. The loop you use to read the file should be same loop where you call the genome constructor.
After this step you should be able to do the following test:
Next we will write a key method in the Genome class: differences(..). This method will take in one non-self parameter, another instance of the Genome class. The goal of this method is to determine the number of differences between the sequence of the given genome (represented by "self") and the sequence of another genome (often called "other").
This code will be very similar to Lab 3, Part E, except without dealing with missing data. Make sure to return the number of differences at the end of the method. At the end of this step, you should be able to perform the following tests in main, with results 2 and 10:
To provide a check on the input, first check that the sequences are of the same length. One way to do this is to use assert. Assert is a keyword, and the expression after assert should evaluate to a boolean. If the boolean is True, the assertion holds and nothing happens. If the boolean is False, the assertion fails and an error is thrown. This is a useful debugging tool because it checks the input before trying out a more complex computation. Try out the example in the shell below:
Incorporate an assertion into your method to check the sequences are the same length.
Now we are ready to use this method on all pairs of DNA sequences. We will store the results of these comparisons in a 2D list, or "list of lists", or matrix. We can think about this data structure as an Excel spreadsheet or another type of grid of information. The entry in the ith row and the jth column will be the number of differences between the ith species and the jth species.
In the shell, experiment with the following code, which will add on 4 lists of 4 zeros each using a loop:
Now change one of the entries of the matrix:
Does the result make sense? To practice classes and encapsulating data, we will make a Matrix class. This class should contain the following methods:
Now we are ready to fill the matrix with non-zero numbers. In main, use a nested for-loop to iterate twice over the list of genomes. Use the two for-loop indices to obtain a pair of genomes. Then call the differences method, using one genome as the instance and the other as the parameter. After obtaining the number of differences from this method, modify the matrix using the set_entry(...) method to store the result.
When you are done with this step you should see a completed matrix. What are the numbers down the diagonal of this matrix? Does that make sense? Here is what you should get for the first row of this matrix:
To make our result a bit more interpretable, we will print the matrix. There are two options for this part:
Edit: it is now optional to include the species names in the printing. So you can print only the numerical data (one row at a time, even if it has commas and no tabs).
First print each species name, which will require writing a getter inside your genome class. Between each name, use a tab character (special character "\t"). This will keep all the numbers aligned.
Then for each row, print the species name first, followed by the elements of the matrix, also including tabs between them so everything is aligned.
Instead of a transcript, copy over just your formatted matrix into a plain text file (it should preserve the formatting). Then answer the three questions below in this txt file:
Make sure that both you and your partner have a copy of all the code written during the lab period. Both partners should submit the files: