CS260 Lab 8: Statistics

Due: Thursday, November 11 at 11:59pm

Overview and goals

The goals of this lab are:

Understand how statistical tests can be used to answer questions about the data we observe
See how randomized trials and permutation testing can give us more precise results and avoid making assumptions
Visualize distributions

Part 1: Coin Toss example

First we will implement the scenario from Handout 17, where we toss a coin 80 times and observe 54 Heads. The question we’re trying to answer is: Is this coin fair?

Implement this part of the lab in coin_toss.py (no command line arguments needed, but make sure to include a main function and other helper functions as appropriate). You will need the scipy library, which can be installed with pip3:

pip3 install scipy

If this doesn’t work, it may be that you have multiple versions of Python 3. In that case run:

python3 -m pip3 install scipy

A) Central limit theorem (CLT) method

First we will try to answer our question using the CLT. Following what we did in class, compute the test statistic (Z-score) for this example, under the null hypothesis of a fair coin. The CLT tells us that this Z-score is approximately drawn from a standard normal distribution: N(0,1).

It is okay to hardcode n = 80, number of Heads = 54, etc. Please encode Tails as 0 and Heads as 1 for the purposes of the lab.

To compute the associated p-value, we will integrate the pdf (probability density function) of the standard normal distribution from the Z-score to positive infinity. To perform a two-sided test (i.e. that the coin is unfair in either direction, not just weighted toward Heads), we will then multiply the result by two, which is equivalent to integrating the pdf from negative infinity to the negative of the Z-score. Note that Z-scores can be positive or negative.

To integrate (approximately) in Python, you can use the quad function:

from scipy.integrate import quad

result, error = quad(func, a, b)

In this example we would integrate the function func from a to b and obtain the area result with error error. See the quad documentation for more information.

Here we want to integrate the pdf of the standard normal distribution. This function is implemented for us as norm.pdf, after doing the following import:

from scipy.stats import norm

See the norm documentation for more information.

After obtaining the p-value, make sure to print it out like this:

CLT p-value: <p-value>

B) Randomized trials method

The CLT method gives us an approximation of the p-value, based on the assumptions of the CLT (finite variance and “large enough” sample size). We can relax these assumptions by creating our own null distribution through randomized trials. If we flip a fair coin 80 times and record the number of Heads, then repeat this process many times, we will obtain an estimate of the null distribution. The steps below provide a rough guide for implementing this idea.

First create a helper method that runs one trial. To make this slightly more general, it could take in the number of tosses n, and return the fraction of Heads observed. Make use of the Python random module.
Next run many trials and record the fraction of Heads each time.
Finally, count how many times you observe a result as or more extreme than 54 out of 80 Heads (make sure to account for the two-sided nature of our hypothesis). This number of extreme examples divided by the number of trials is our p-value. Print the p-value in a similar fashion:

Random trials p-value: <p-value>

Plotting the distribution

The last step of Part 1 is to plot the null distribution and our observed data. Use plt.hist documentation here to plot a histogram of the fraction of Heads observed for all of your trials. I would recommend normalizing the histogram (so it “integrates” to 1) using density=True.

Then create a way of showing where our observed data (i.e. 54 Heads out of 80 tosses) lies on this distribution. Make sure to include axis labels and a title. Save your plot as:

figures/coin_toss.pdf

Part 2: Genome Size example

In the next part of the lab we will investigate a database of genomes from the Streptococcus suis bacteria, which is a zoonotic disease that can be transmitted from pigs to humans. The file SSuis_Stats.xlsx contains statistics about individual S. suis genomes from several populations. The question we will answer in Part 2 is: Are the genome sizes of Population 1 and Population 2 significantly different?

Implement the next part of the lab in genome_size.py (no command line arguments needed). To read in this Excel file we will need a few external libraries, which can be installed with pip3:

pip3 install pandas openpyxl xlrd

To read in the data, I recommend creating a helper function (that takes in a string filename) and returns two lists: one of the genome sizes of Population 1, and similarly for Population 2. First read the data into a pandas data frame:

import pandas as pd

# inside helper function:
data_frame = pd.read_excel(filename, header=0) # header on line 0
print(data_frame)

See the pandas excel documentation for more information about reading Excel files into data frames.

We’re not going to discuss pandas in too much detail, but one of the nice things about this library is that we can obtain subsets of our data fairly easily.

Use the subset documentation to devise a way to filter the rows so you only have those where “Population” is equal to 1. Similarly, filter the rows based on “Population” 2. After that, filter the columns so you only have “Total Genome Size” for each population. (Alternatively you could select this column first, then filter based on population). Finally, convert each set of numbers to a list and return both lists.

To double check your result, the first few genome sizes for each population are:

pop1 = [2100834, 2098424, ... ]
pop2 = [2168777, 2324897, ... ]

A) Central limit theorem (CLT) method

Similarly to Part 1, we will now use two different methods to obtain p-values. First we will investigate the CLT method. In this genome size example, we do not know the true mean and variance of the null distribution. It’s difficult to say exactly what the null distribution is! But our null hypothesis is that the genome sizes of samples from Population 1 and 2 are from the same (null) distribution.

Since we don’t know the mean and variance, we need to use a t-test. See the t-test documentation for more information, but the basic idea is:

from scipy.stats import ttest_ind

result = ttest_ind(a, b)

In this case we will assume equal variances and perform a two-sided test. result contains both a test statistic and a p-value. Make sure to print out your p-value like this:

CLT p-value: <p-value>

B) Permutation testing method

In this example our randomized trial method will also be slightly different, since we don’t have access to the true null distribution. Instead, we will use permutation testing. This is a widely applicable concept, but in our case it will mean permuting the labels (i.e. Population 1 vs. 2) of the data, which will mimic the idea that the genome sizes were all drawn from the same distribution.

First, implement a helper function that permutes the labels of the data. In practice, this function will take in two lists (one for each population) and return two lists with the data rearranged (so the first list still represents Population 1 and the second Population 2). It is up to you how to do this step, but think of your algorithm in advance, then start coding. Make sure that the size of the first returned list is the same as the size of the first input list (and similarly for the second).
Next, run many trials of permuting the labels. Each time, record the difference between the mean of those labeled Population 1 and those labeled Population 2. Since this will be a two-sided test, it might be easiest to record the absolute value of the difference.
Finally, count how many times you observe a result as or more extreme than the true difference in means of the original populations (make sure to account for the two-sided nature of our hypothesis). This number of extreme examples divided by the number of trials is our p-value. Print the p-value in a similar fashion:

Permutation testing p-value: <p-value>

Note that your values may not be as similar as Part 1, but should still be quite close (same order of magnitude).

Part 3: COVID cases analysis (OPTIONAL!)

Note that this part is optional, but worth a small amount of extra credit for very thorough analyses.

In this last part you can analyze the database of COVID cases from January 1, 2021, which is in the file 01-01-2021.csv. This data is from the Johns Hopkins COVID site. This dataset shows various COVID statistics (number of cases, etc) broken down by region for one particular day. The idea is to ask a question about this dataset that can be answered with a procedure similar to Part 2.

One possible analysis would be to choose two states in the US, then gather a list of cases per county in each state. This will give you two lists, similar to Part 2. You could then frame a question like this: “On January 1, 2021, was COVID significantly worse in Pennsylvania or New York?” You could perform a similar analysis for other parts of the world, etc.

Include your implementation in covid_cases.py and write up your procedure/results in your README.md, following the questions below.

Part 4: Analysis

Include the answers to the following questions in your README.md:

For Part 1, include both your p-values below. What is the probabilistic interpretation of the p-value in this coin toss situation? Given this interpretation, do you reject the null hypothesis (fair coin) or fail to reject?
For Part 2, include both your p-values below. Do you reject the null hypothesis (genomes from Population 1 and Population 2 are roughly the same size) or fail to reject?
For Part 2, you may have noticed that your p-values were not as close as in Part 1. What assumption of the CLT might be violated with this dataset?
(Optional) If you did Part 3, write up your procedure and results here (what were you trying to test, how you went about it, what you found, etc).

Acknowledgements

Coin toss example based on MIT OpenCourseWare 18.650. Streptococcus suis dataset and genome size example from Eric Miller.