Sara Sheehan

CSC 334: Topics in Computational Biology

Homework 8: Hidden Markov Models (PSMC)

Due: Thursday, Dec. 3, 11:59pm on Moodle

The goal of this assignment is to work with a popular Hidden Markov Model (HMM) that is used for determining a history of effective population sizes over time. This is the Pairwise Sequentially Markovian Coalescent (PSMC). This will also give you more practice downloading, installing, and using existing software.

Here are the datasets we'll be working with. The size histories of each population are the same as in Homework 7, but the sequences are longer so that we have more data to work with.

Datasets: (right-click to download)

dataset1_big.fasta

dataset2_big.fasta

Download and install PSMC
Download link: PSMC. Click on "Download ZIP", and save it to a place that makes sense for you. Unzip this file and navigate to the folder "psmc-master" in the terminal. After that, follow the instructions in the README file to install PSMC:
```
 make 
```
```
 cd utils 
```
```
 make 
```
Download and install gnuplot
If the last step didn't work (making the util binaries), then you probably don't have gnuplot. You can download it here: gnuplot. The latest version is probably best. Unzip this file and navigate to the gnuplot folder. Then install gnuplot (instructions in INSTALL) by typing:
```
 ./configure 
```
```
 make 
```
```
 make install 
```
Let me know if you run into any problems with this step or installing PSMC.

Converting the data
One common step in using existing software is converting your data to the right format. PSMC is no exception. The input format to PSMC consists of "T"s (kind of like 0's, no difference between sequences), and "K"s (kind of like 1's, a difference between sequences). Each consecutive pair of sequences in the fasta file should be paired together and converted to this T/K format. For example:
```
 >hap_0 
 ACGT 
 >hap_1 
 ATGT
```
should be converted to
```
 >diploid_0 
 TKTT 
```
So it is kind of like fasta format, but each pair (i.e. individual) is now one sequence. (Note: it doesn't matter what you call each sequence.) Write a python script to perform this conversion, and run it on each dataset above. The extension for these files should be .psmcfa (fa for fasta).

Running PSMC on the data
Now we're ready to run PSMC. For each dataset, run the following commandline:
```
 ./psmc -p 4+2*5+4 -t 10 -N 20 -r 2 -o dataset.psmc dataset.psmcfa 
```
This will probably take a few minutes. The output file has extension .psmc.

Graphing the PSMC output
To visualize the results, now run the commandline:
```
 utils/psmc_plot.pl dataset dataset.psmc
```
For each dataset. The output file will have extension .eps, which you should be able to open and display as a graph.

Analysis
Answer the following questions in a separate file.
- In the PSMC HMM, what is the hidden state, and what are the emissions/observations?
- Why does this hidden state give us information about the population sizes? (This part is fairly complicated within the PSMC code, but discuss the main intuition.)
- What is the "-r" input parameter in the PSMC commandline? How might this ratio affect the accuracy of the results? Think about what would happen if this ratio were either really big or really small.
- Describe in words the population size histories of the two datasets. Does this agree with your preliminary analysis from Homework 7?

Submit your code (python file), your 2 population size plots, and a file with your answers to the questions in step 6, all on Moodle.

Conversion Solution:

hw8_sol.py