CS260 Lab 8: Statistics and Visualization

Due: Wednesday, November 15 at 11:59pm

Overview and goals

The goals of this lab are:

See how dimensionality reduction (specifically PCA) can be used for visualization
Practice making visualization choices as part of creating a plot
Understand the input/output of PCA, as well as the main algorithmic steps
Interpret the principal components in terms of evolutionary relationships
Understand how statistical tests can be used to answer questions about the data we observe
See how randomized trials and permutation testing can give us more precise results and avoid making assumptions

Part 1: PCA on the Iris Flower Dataset

First we will apply PCA to a classic dataset of Iris flowers. This dataset is actually built-in to the package we will use: sklearn. If you do not already have this library, install it using pip3:

pip3 install scikit-learn

If this doesn’t work, it may be that you have multiple versions of Python 3. In that case run:

python3 -m pip install scikit-learn

Write your code in the file pca.py. Begin by importing the necessary packages:

import numpy as np
import matplotlib.pyplot as plt

from sklearn import decomposition
from sklearn import datasets

The documentation for the PCA module can be found here.

The datasets module contains some example datasets, so we don’t have to download them. To load the Iris dataset, use:

iris = datasets.load_iris()
X = iris.data
y = iris.target

X is our nxp matrix, where n=150 samples and p=4 features. Print X. The features in this case are not DNA data, but physical features of the flowers:

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm

Our goal here is to use PCA to visualize this data (since we cannot view the points in 4D!) Through the process of visualization, we’ll see if our data can be clustered based on the features above.

Now print y. These are the labels of each sample (subspecies of flower in this case). They are called:

Iris Setosa (label 0)
Iris Versicolour (label 1)
Iris Virginica (label 2)

y is not used in PCA, it is only used afterward to visualize any clustering apparent in the results.

Run PCA

Using the PCA documentation, create an instance of PCA with 2 components, then “fit” it using only X. Then use the transform method to transform the features (X) into 2D. (3 lines of code, but make sure you know what each line is doing!)

Plot the results

Now we will use matplotlib to plot the results in 2D, as a scatter plot. The scatter function takes in a list of x-coordinates, a list of y-coordinates, and a list of colors for each point. Here the two axes are PC1 and PC2, and the points to plot are formed from the first and second columns of the transformed data.

plt.scatter(x_coordinates, y_coordinates, c=colors) # example

First plot without the colors to see what you get. Make sure to include axis labels and a title.

Then create a color dictionary that maps each label to a chosen color. Some colors have 1-letter abbreviations.

‘b’ - blue
‘g’ - green
‘r’ - red
‘c’ - cyan
‘m’ - magenta
‘y’ - yellow
‘k’ - black
‘w’ - white

There is also a longer list of named colors here. After creating a color dictionary, use it to create a list of colors for all the points (thinking about what colors are easily distinguishable), then pass that in to your scatter method.

Legends are often created automatically in matplotlib, but for a scatter plot it’s a bit more difficult. We’ll actually create more plotting objects with no points, and then use these for our legend. Consider the code below. For each color, we’re creating a null-plot (the 'o' means use circles for plotting). Then we’re mapping these null objects to names of the flowers (create a list of names for each class).

# create legend
leg_objects = []
for i in range(3):
    circle, = plt.plot([], 'o', c=color_dict[i])
    leg_objects.append(circle)
plt.legend(leg_objects,names)

At the end, you should produce and save a plot called figures/iris.pdf (make sure to add/commit/push this figure).

figures/iris.pdf

The last step of this part is to create a PCA visualization for one other dataset that we have analyzed so far. I would recommend the phoneme dataset from Lab 7 (but remove the label when running PCA). Then color the points according to their label (similar to the iris dataset). For this part you can still work in the file pca.py. Try to avoid duplicated code, but it’s okay to have some since the plotting needs to be customized. Finally, save this plot as:

figures/other_pca.pdf

Part 2: Coin Toss example

First we will implement the scenario from Handout 17, where we toss a coin 80 times and observe 54 Heads. The question we’re trying to answer is: Is this coin fair?

Implement this part of the lab in coin_toss.py (no command line arguments needed, but make sure to include a main function and other helper functions as appropriate). You will need the scipy library, which can be installed with pip3:

pip3 install scipy

If this doesn’t work, it may be that you have multiple versions of Python 3. In that case run:

python3 -m pip install scipy

A) Central limit theorem (CLT) method

First we will try to answer our question using the CLT. Following what we did in class, compute the test statistic (Z-score) for this example, under the null hypothesis of a fair coin. The CLT tells us that this Z-score is approximately drawn from a standard normal distribution: N(0,1).

It is okay to hardcode n = 80, number of Heads = 54, etc. Please encode Tails as 0 and Heads as 1 for the purposes of the lab.

To compute the associated p-value, we will integrate the pdf (probability density function) of the standard normal distribution from the Z-score to positive infinity. To perform a two-sided test (i.e. that the coin is unfair in either direction, not just weighted toward Heads), we will then multiply the result by two, which is equivalent to integrating the pdf from negative infinity to the negative of the Z-score. Note that Z-scores can be positive or negative.

To integrate (approximately) in Python, you can use the quad function:

from scipy.integrate import quad

result, error = quad(func, a, b)

In this example we would integrate the function func from a to b and obtain the area result with error error. See the quad documentation for more information.

Here we want to integrate the pdf of the standard normal distribution. This function is implemented for us as norm.pdf, after doing the following import:

from scipy.stats import norm

See the norm documentation for more information.

After obtaining the p-value, make sure to print it out like this:

CLT p-value: <p-value>

B) Randomized trials method

The CLT method gives us an approximation of the p-value, based on the assumptions of the CLT (finite variance and “large enough” sample size). We can relax these assumptions by creating our own null distribution through randomized trials. If we flip a fair coin 80 times and record the number of Heads, then repeat this process many times, we will obtain an estimate of the null distribution. The steps below provide a rough guide for implementing this idea.

First create a helper method that runs one trial. To make this slightly more general, it could take in the number of tosses n, and return the fraction of Heads observed. Make use of the Python random module.
Next run many trials and record the fraction of Heads each time.
Finally, count how many times you observe a result as or more extreme than 54 out of 80 Heads (make sure to account for the two-sided nature of our hypothesis). This number of extreme examples divided by the number of trials is our p-value. Print the p-value in a similar fashion:

Random trials p-value: <p-value>

Plotting the distribution

The last step of Part 1 is to plot the null distribution and our observed data. Use plt.hist documentation here to plot a histogram of the fraction of Heads observed for all of your trials. I would recommend normalizing the histogram (so it “integrates” to 1) using density=True.

Then create a way of showing where our observed data (i.e. 54 Heads out of 80 tosses) lies on this distribution. Make sure to include axis labels and a title. Save your plot as:

figures/coin_toss.pdf

Part 3: Genome Size example

In the next part of the lab we will investigate a database of genomes from the Streptococcus suis bacteria, which is a zoonotic disease that can be transmitted from pigs to humans. The file SSuis_Stats.xlsx contains statistics about individual S. suis genomes from several populations. The question we will answer in Part 2 is: Are the genome sizes of Population 1 and Population 2 significantly different?

Implement the next part of the lab in genome_size.py (no command line arguments needed). To read in this Excel file we will need a few external libraries, which can be installed with pip3:

pip3 install pandas openpyxl xlrd

python3 -m pip install pandas openpyxl xlrd

To read in the data, I recommend creating a helper function (that takes in a string filename) and returns two lists: one of the genome sizes of Population 1, and similarly for Population 2. First read the data into a pandas data frame:

import pandas as pd

# inside helper function:
data_frame = pd.read_excel(filename, header=0) # header on line 0
print(data_frame)

See the pandas excel documentation for more information about reading Excel files into data frames.

We’re not going to discuss pandas in too much detail, but one of the nice things about this library is that we can obtain subsets of our data fairly easily.

Use the subset documentation to devise a way to filter the rows so you only have those where “Population” is equal to 1. Similarly, filter the rows based on “Population” 2. After that, filter the columns so you only have “Total Genome Size” for each population. (Alternatively you could select this column first, then filter based on population). Finally, convert each set of numbers to a list and return both lists.

To double check your result, the first few genome sizes for each population are:

pop1 = [2100834, 2098424, ... ]
pop2 = [2168777, 2324897, ... ]

A) Central limit theorem (CLT) method

Similarly to Part 1, we will now use two different methods to obtain p-values. First we will investigate the CLT method. In this genome size example, we do not know the true mean and variance of the null distribution. It’s difficult to say exactly what the null distribution is! But our null hypothesis is that the genome sizes of samples from Population 1 and 2 are from the same (null) distribution.

Since we don’t know the mean and variance, we need to use a t-test. See the t-test documentation for more information, but the basic idea is:

from scipy.stats import ttest_ind

result = ttest_ind(a, b)

In this case we will assume equal variances and perform a two-sided test. result contains both a test statistic and a p-value. Make sure to print out your p-value like this:

CLT p-value: <p-value>

B) Permutation testing method

In this example our randomized trial method will also be slightly different, since we don’t have access to the true null distribution. Instead, we will use permutation testing. This is a widely applicable concept, but in our case it will mean permuting the labels (i.e. Population 1 vs. 2) of the data, which will mimic the idea that the genome sizes were all drawn from the same distribution.

First, implement a helper function that permutes the labels of the data. In practice, this function will take in two lists (one for each population) and return two lists with the data rearranged (so the first list still represents Population 1 and the second Population 2). It is up to you how to do this step, but think of your algorithm in advance, then start coding. Make sure that the size of the first returned list is the same as the size of the first input list (and similarly for the second).
Next, run many trials of permuting the labels. Each time, record the difference between the mean of those labeled Population 1 and those labeled Population 2. Since this will be a two-sided test, it might be easiest to record the absolute value of the difference.
Finally, count how many times you observe a result as or more extreme than the true difference in means of the original populations (make sure to account for the two-sided nature of our hypothesis). This number of extreme examples divided by the number of trials is our p-value. Print the p-value in a similar fashion:

Permutation testing p-value: <p-value>

Note that your values may not be as similar as Part 1, but should still be quite close (same order of magnitude).

Part 4: COVID cases analysis (OPTIONAL!)

In this last part you can analyze the database of COVID cases from January 1, 2021, which is in the file 01-01-2021.csv. This data is from the Johns Hopkins COVID site. This dataset shows various COVID statistics (number of cases, etc) broken down by region for one particular day. The idea is to ask a question about this dataset that can be answered with a procedure similar to Part 2.

One possible analysis would be to choose two states in the US, then gather a list of cases per county in each state. This will give you two lists, similar to Part 2. You could then frame a question like this: “On January 1, 2021, was COVID significantly worse in Pennsylvania or New York?” You could perform a similar analysis for other parts of the world, etc.

Include your implementation in covid_cases.py and write up your procedure/results in your README.md, following the questions below.

Part 5: Analysis

Include the answers to the following questions in your README.md:

For the iris flower dataset in Part 1, what genealogical relationships can we infer from the PCA plot? For your other chosen dataset, what conclusions can you draw from the PCA plot? Does the number of visual clusters match the number of labels?
For Part 2, include both your p-values below. What is the probabilistic interpretation of the p-value in this coin toss situation? Given this interpretation, do you reject the null hypothesis (fair coin) or fail to reject?
For Part 3, include both your p-values below. Do you reject the null hypothesis (genomes from Population 1 and Population 2 are roughly the same size) or fail to reject?
For Part 3, you may have noticed that your p-values were not as close as in Part 1. What assumption of the CLT might be violated with this dataset?
(Optional) If you did Part 4, write up your procedure and results here (what were you trying to test, how you went about it, what you found, etc).

Acknowledgements

Coin toss example based on MIT OpenCourseWare 18.650. Streptococcus suis dataset and genome size example from Eric Miller.