CSC 334: Topics in Computational Biology

Homework 7: Measures of sequence diversity

Due: Tuesday, Nov. 24, 11:59pm on Moodle

The goal of this assignment is to implement the measures of sequence diversity that we've been discussing in class. This workflow is very common when a new dataset is encountered. It is shorter than a normal homework so you have time to also work on your final projects.

Here is a template python file to start out with: hw7.py. You can modify the arguments of the functions as necessary. For the questions below, you don't need to submit a separate file, the answers can be in the code file, just make it clear what is going on and what is being computed.

Datasets: (right-click to download)

dataset1.fasta

dataset2.fasta

  1. Number of segregating sites: S

    Compute and report the number of segregating sites (SNPs) for each of these two datasets.

  2. Average pairwise heterozygosity: π

    Compute and report π for each dataset. You can either divide by the sequence length or not (these two datasets have the same sequence length so π is comparable either way).

  3. Site frequency spectrum: SFS

    Compute and report the folded SFS for each dataset.

  4. Analysis

    One of these datasets has a constant population size and one has undergone recent population growth. Which is which and why is that the case?

When I run your code, make sure the print-out is easy to understand and shows everything being computed. Submit your code on Moodle.