CS364 Lab 7: Machine Learning for Population Genetics

Overview and goals

The goals of this lab are:

Understand measures of sequence diversity
See how summary statistics can be used for classification
Compare summary statistics to a deep learning method
Apply both methods to a realistic natural selection dataset

Clone your Lab 7 git repo as usual. You should see following starter files:

README.md - questionnaire to be completed for your final submission
cnn.py - starter code for the CNN model
data_iterator.py - starter code to read in the data
train_pi.py - where you should implement a method for using pi for classification
train_cnn.py - where you should implement deep learning training and evaluation

You are also welcome to add additional python files as long as the top-level files are the same.

Part 1: Pairwise Heterozygosity (pi)

First make sure you can access the data, which is in the folder:

smathieson@joshi:~/GIT/CS364/cs364-lab7$ ls /homes/smathieson/Public/cs364/1000g/
CEU  CHB  YRI
smathieson@joshi:~/GIT/CS364/cs364-lab7$ ls /homes/smathieson/Public/cs364/1000g/CHB
matrices_CHB_neutral_3000.npy  matrices_CHB_sel01_600.npy  matrices_CHB_sel025_600.npy  matrices_CHB_sel05_600.npy  matrices_CHB_sel1_600.npy

The three sub-folders contain simulated data based on human populations:

CEU: North European
CHB: East Asian
YRI: West African

Some of the regions are under selection and some are neutral. For example if you run the starter file data_iterator.py you should see:

$ python3 data_iterator.py CHB
test shapes X,y (1080, 198, 36) (1080, 1)
train shapes X,y (50, 198, 36) (50, 1)

This means there are 1080 test examples, each with n=198 haplotypes (98 individuals) and S=36 SNPs. There is one label (0 for neutral and 1 for selection) for each example. The train shape is just for one batch (of size 50). Look through data_iterator.py and make sure the code makes sense.

Next, in train_pi.py, implement a helper function to compute pi for one region, using the fast algorithm discussed in class on Tuesday (this involves computing the folded site frequency spectrum first). For example, the first ~~train~~ test region has pi:

~~pi 4.337794185509924~~

pi 4.49761575142286

The last step for this part is to devise an algorithm to find a threshold of pi such that if pi>threshold, we classify the region as neutral, and if pi < threshold, we classify the region as selected. Hint: think about how to use the training data to obtain the threshold, then the testing data to evaluate the threshold.

Part 2: Convolutional Neural Network (CNN)

The next part of the lab involves training a CNN (provided in cnn.py) to achieve the same task (i.e. binary classification of regions into neutral vs. selected).

To use tensorflow, you’ll need to put these lines at the end of your .bashrc file:

export PATH=/packages/cs/python3.7.7/bin:/usr/local/cuda-10.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64:/packages/cs/python3.7.7/lib

The follow the tensorflow tutorial here (implementing the steps in train_cnn.py) to train the CNN on batches of training data to predict the correct output. At the end, report your training and testing accuracy. Notes:

for the loss function, make sure to use binary cross entropy, with from_logits=True since we didn’t apply softmax in the cnn.py CNN.
about 10 epochs should be enough to reduce the loss

Analysis and Submitting your work

Make sure to push your code often, only the final version will be submitted. The README.md file asks a few questions about your results.