CS364 Lab 7: Machine Learning for Population Genetics


Overview and goals

The goals of this lab are:

Clone your Lab 7 git repo as usual. You should see following starter files:

You are also welcome to add additional python files as long as the top-level files are the same.


Part 1: Pairwise Heterozygosity (pi)

First make sure you can access the data, which is in the folder:

smathieson@joshi:~/GIT/CS364/cs364-lab7$ ls /homes/smathieson/Public/cs364/1000g/
CEU  CHB  YRI
smathieson@joshi:~/GIT/CS364/cs364-lab7$ ls /homes/smathieson/Public/cs364/1000g/CHB
matrices_CHB_neutral_3000.npy  matrices_CHB_sel01_600.npy  matrices_CHB_sel025_600.npy  matrices_CHB_sel05_600.npy  matrices_CHB_sel1_600.npy

The three sub-folders contain simulated data based on human populations:

Some of the regions are under selection and some are neutral. For example if you run the starter file data_iterator.py you should see:

$ python3 data_iterator.py CHB
test shapes X,y (1080, 198, 36) (1080, 1)
train shapes X,y (50, 198, 36) (50, 1)

This means there are 1080 test examples, each with n=198 haplotypes (98 individuals) and S=36 SNPs. There is one label (0 for neutral and 1 for selection) for each example. The train shape is just for one batch (of size 50). Look through data_iterator.py and make sure the code makes sense.

Next, in train_pi.py, implement a helper function to compute pi for one region, using the fast algorithm discussed in class on Tuesday (this involves computing the folded site frequency spectrum first). For example, the first train test region has pi:

pi 4.337794185509924

pi 4.49761575142286

The last step for this part is to devise an algorithm to find a threshold of pi such that if pi>threshold, we classify the region as neutral, and if pi < threshold, we classify the region as selected. Hint: think about how to use the training data to obtain the threshold, then the testing data to evaluate the threshold.


Part 2: Convolutional Neural Network (CNN)

The next part of the lab involves training a CNN (provided in cnn.py) to achieve the same task (i.e. binary classification of regions into neutral vs. selected).

To use tensorflow, you’ll need to put these lines at the end of your .bashrc file:

export PATH=/packages/cs/python3.7.7/bin:/usr/local/cuda-10.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64:/packages/cs/python3.7.7/lib

The follow the tensorflow tutorial here (implementing the steps in train_cnn.py) to train the CNN on batches of training data to predict the correct output. At the end, report your training and testing accuracy. Notes:


Analysis and Submitting your work

Make sure to push your code often, only the final version will be submitted. The README.md file asks a few questions about your results.