Homework 4: Dimensionality Reduction
Due: Monday, Oct 17, 11:59pm on Moodle
In this homework, we will use a combination of methods to perform exploratory data analysis. We'll be studying the NCI60 Cancer Microarray Project dataset. This dataset consists of 6830 genes for each of 64 samples, called cell-lines (m << p). One approach in cancer therapy is to compare a new sample to a set of previously studied samples, to see which gene profiles are the most similar. This can indicate which drugs might be useful for a new variant. In this homework we'll be using dimensionality reduction and clustering to compare and group these cell-lines. These data points do have associated names, but the structure within the dataset is not fully captured from the these labels.
First, download the data from our textbook website: NCI (microarray). Then run PCA on this dataset, similar to the Iris flower dataset (make sure the matrix is in the right orientation). Use the result of PCA to create a 2D visualization of the data. No need to label the results yet, we'll do that later on.
Next, run k-means in two different ways:
Choose any k you think is reasonable based on your PCA plot. To visualize the results of k-means, color the points on your plot based on their cluster membership (once for the raw data and once for the transformed data). What differences, if any, do you notice in the clusterings? Why might this be the case? Include your responses in your reflection under a heading of "Part 2".
Now vary k from 1 to 12 on the transformed data. For each value of k, compute the within-cluster sum of squares. Also compute the total sum of squares (i.e. the within-cluster sum of squares for one cluster). Using these values, compute the fraction of variance explained for each k. Using these values, create a matplotlib plot showing the fraction of variance explained as a function of k. Do you see an "elbow"? Based on your results, select your choice for a "best" k.
It may be helpful to use the "plot" method, which takes in a list of x-values and a corresponding list of y-values:
plt.plot(x_lst, y_lst, 'ro-') # red circles with lines connecting them
Label your axes and create a title for your figure. Save your figure as "elbow_plot.png" (PDF okay too).
Based on your selection of k from Part 4, color your PCA plot according to cluster membership. Then use the names of the samples to include more information in your visualization. Exactly how you do this is up to you. You could label each point separately, you could create a way of marking samples with the same name, etc. Include a legend, label your axes, and title your plot. Save your figure as "clustering.png" (PDF okay too).
Comment on your results from Part 4 in your reflection. Do you think your visualization tells the whole story, or is there more information that cannot be captured in 2D? How did your choice of k (from the elbow plot) compare with your initial impressions from PCA?
Comment on how this homework went relative to the previous weeks. In terms of implementation, what concepts are getting more familiar? What skills/techniques need more practice?
For a small amount of extra credit, run your UPGMA method on this data and submit a transcript of the results showing the clusters being formed, along with the names of the samples. (First you'll need to create a distance matrix.) Comment on the hierarchical clustering results in your reflection. One step further: create a tree visualization for the results.