The goals of this lab are:
Clone your Lab 7 git repo as usual. You should see following starter files:
README.md
- questionnaire to be completed for your final submissioncnn.py
- starter code for the CNN modeldata_iterator.py
- starter code to read in the datatrain_pi.py
- where you should implement a method for using pi for classificationtrain_cnn.py
- where you should implement deep learning training and evaluationYou are also welcome to add additional python files as long as the top-level files are the same.
First make sure you can access the data, which is in the folder:
smathieson@joshi:~/GIT/CS364/cs364-lab7$ ls /homes/smathieson/Public/cs364/1000g/
CEU CHB YRI
smathieson@joshi:~/GIT/CS364/cs364-lab7$ ls /homes/smathieson/Public/cs364/1000g/CHB
matrices_CHB_neutral_3000.npy matrices_CHB_sel01_600.npy matrices_CHB_sel025_600.npy matrices_CHB_sel05_600.npy matrices_CHB_sel1_600.npy
The three sub-folders contain simulated data based on human populations:
Some of the regions are under selection and some are neutral. For example if you run the starter file data_iterator.py
you should see:
$ python3 data_iterator.py CHB
test shapes X,y (1080, 198, 36) (1080, 1)
train shapes X,y (50, 198, 36) (50, 1)
This means there are 1080 test examples, each with n=198
haplotypes (98 individuals) and S=36
SNPs. There is one label (0 for neutral and 1 for selection) for each example. The train shape is just for one batch (of size 50). Look through data_iterator.py
and make sure the code makes sense.
Next, in train_pi.py
, implement a helper function to compute pi for one region, using the fast algorithm discussed in class on Tuesday (this involves computing the folded site frequency spectrum first). For example, the first train test region has pi:
pi 4.337794185509924
pi 4.49761575142286
The last step for this part is to devise an algorithm to find a threshold of pi such that if pi>threshold, we classify the region as neutral, and if pi < threshold, we classify the region as selected. Hint: think about how to use the training data to obtain the threshold, then the testing data to evaluate the threshold.
The next part of the lab involves training a CNN (provided in cnn.py
) to achieve the same task (i.e. binary classification of regions into neutral vs. selected).
To use tensorflow
, you’ll need to put these lines at the end of your .bashrc
file:
export PATH=/packages/cs/python3.7.7/bin:/usr/local/cuda-10.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64:/packages/cs/python3.7.7/lib
The follow the tensorflow tutorial here (implementing the steps in train_cnn.py
) to train the CNN on batches of training data to predict the correct output. At the end, report your training and testing accuracy. Notes:
from_logits=True
since we didn’t apply softmax in the cnn.py
CNN.
Make sure to push your code often, only the final version will be submitted. The README.md
file asks a few questions about your results.