Homework 8: Hidden Markov Models (PSMC)
Due: Thursday, Dec. 3, 11:59pm on Moodle
The goal of this assignment is to work with a popular Hidden Markov Model (HMM) that is used for determining a history of effective population sizes over time. This is the Pairwise Sequentially Markovian Coalescent (PSMC). This will also give you more practice downloading, installing, and using existing software.
Here are the datasets we'll be working with. The size histories of each population are the same as in Homework 7, but the sequences are longer so that we have more data to work with.
Datasets: (right-click to download)
Download link: PSMC. Click on "Download ZIP", and save it to a place that makes sense for you. Unzip this file and navigate to the folder "psmc-master" in the terminal. After that, follow the instructions in the README file to install PSMC:
make
cd utils
make
If the last step didn't work (making the util binaries), then you probably don't have gnuplot. You can download it here: gnuplot. The latest version is probably best. Unzip this file and navigate to the gnuplot folder. Then install gnuplot (instructions in INSTALL) by typing:
./configure
make
make installLet me know if you run into any problems with this step or installing PSMC.
One common step in using existing software is converting your data to the right format. PSMC is no exception. The input format to PSMC consists of "T"s (kind of like 0's, no difference between sequences), and "K"s (kind of like 1's, a difference between sequences). Each consecutive pair of sequences in the fasta file should be paired together and converted to this T/K format. For example:
>hap_0should be converted to
ACGT
>hap_1
ATGT
>diploid_0So it is kind of like fasta format, but each pair (i.e. individual) is now one sequence. (Note: it doesn't matter what you call each sequence.) Write a python script to perform this conversion, and run it on each dataset above. The extension for these files should be .psmcfa (fa for fasta).
TKTT
Now we're ready to run PSMC. For each dataset, run the following commandline:
./psmc -p 4+2*5+4 -t 10 -N 20 -r 2 -o dataset.psmc dataset.psmcfaThis will probably take a few minutes. The output file has extension .psmc.
To visualize the results, now run the commandline:
utils/psmc_plot.pl dataset dataset.psmcFor each dataset. The output file will have extension .eps, which you should be able to open and display as a graph.
Answer the following questions in a separate file.
Conversion Solution: