CS66 Lab 8: Unsupervised Learning

This lab is optional

Overview and goals

The goals of this lab are:

Practice applying flat/partitional clustering algorithms
Understand the difference between K-means and Gaussian Mixture Models
See how PCA can be applied to a real-world dataset to visualize population relationships

Part 1: K-means on the Zip Code Dataset

In this lab we will explore K-means, a very popular clustering algorithm. We will use the same data as for Lab 1, which is labeled (zip code data):

/home/smathieson/public/cs66/zip/zip.test
/home/smathieson/public/cs66/zip/zip.train

To train the model however, we will not use the labels (which makes this method unsupervised learning). But since we have them, we have an additional way of assessing accuracy.

Step 1: Documentation

K-means is implemented in sklearn; first investigate the documentation.

Look at the arguments that can be passed in to the constructor for K-means. The most important for this lab will be n_clusters. Also look at the available methods. We will use fit and predict, but only passing in the features (i.e. X).

Step 2: Prepare the data

Create a new file called kmeans.py. We will be using the same zip code data, so copy over the code that reads in this data and filters it for just the data points with label 2 or 3. Also make sure that you’ve removed the first column (y) and stored it separately from the features (X). Only the features should be passed into K-means. If you have an accuracy function that takes in the true labels and predicted labels, copy that over as well.

Step 3: Fit the model

Create an instance of the KMeans class, passing in the desired number of clusters (try out K matching the number of classes you retained, and K not matching this number). Then use the fit method with the training data, which will determine the cluster centers.

Step 4: Assess the accuracy

After this, use the predict method on the test data to predict the cluster membership of each test example. Since we do have labels in this case, we can evaluate the accuracy of this prediction. Pass in the output of predict and the true training labels into your accuracy method (or write an accuracy method) and print the result. Why might your accuracy be very low?

Step 5: Unfiltered data

So far, we have just considered two classes, but now we are in a position to consider all of them. Perform the same analysis with the unfiltered dataset. What changes need to be made? Again assess the accuracy. It is probably very low. Why is that? How can you post-process the K-means labels to properly consider the accuracy?

Part 2: Gaussian Mixture Models on the Iris Flower Dataset

Next we will be visualizing a Gaussian Mixture Model (GMM) with different covariance matrices.

Credit: based on this Iris flower clustering exercise.

Step 1: Load the data

First, download gmm.py and look at the code. First load this data to obtain X and y, as shown below. Run the code and make sure the scatter plot shows up.

iris = datasets.load_iris()
X = iris.data
y = iris.target

Step 2: Gaussian mixture model

The next step is to create the Gaussian mixture model and fit it to our data. Here is the GMM documentation.

Create an instance and then use fit with our features X. Use 3 components and the argument:

covariance_type='spherical' # use spherical covariances to start

Step 3: Model parameters

After fitting the model, print the means and covariances. Make sure the dimensions make sense.

Step 4: Plot the covariances

Using the function provided, we can visualize the covariances. To call this method, pass in your GMM object, as well as the “axes” for your graph.

make_ellipses(gaussian_mixture_model, ax) # pass in your model object

Make sure you can visualize the results. The colors might not match up correctly (you could run it a few times until they do) since this is an unsupervised learning approach. Does this agree with our notion of a spherical covariance matrix?

Step 5: Different covariances

Now change the covariance matrix type to diag (diagonal covariance matrix) and run the results again. Does this look like a better or worse fit? Do the same thing for full and tied and comment on the results.

Part 3: PCA on the Iris Flower Dataset

Finally we will apply PCA to a classic dataset of Iris flowers. This dataset is actually built-in to the package we will use: sklearn. To use this module, we will need to work in the virtual machine for our class:

Create a new python file pca_iris.py and import the necessary packages:

import numpy as np
import matplotlib.pyplot as plt

from sklearn import decomposition
from sklearn import datasets

The documentation for the PCA module can be found here.

The datasets module contains some example datasets, so we don’t have to download them. To load the Iris dataset, use:

iris = datasets.load_iris()
X = iris.data
y = iris.target

X is our nxp matrix, where n=150 samples and p=4 features. Print X. The features in this case are physical features of the flowers:

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm

Our goal here is to use PCA to visualize this data (since we cannot view the points in 4D!) Through the process of visualization, we’ll see if our data can be clustered based on the features above.

Now print y. These are the labels of each sample (subspecies of flower in this case). They are called:

Iris Setosa (label 0)
Iris Versicolour (label 1)
Iris Virginica (label 2)

y is not used in PCA, it is only used afterward to visualize any clustering apparent in the results.

Run PCA

Using the PCA documentation, create an instance of PCA with 2 components, then “fit” it using only X. Then use the transform method to transform the features (X) into 2D. (3 lines of code!)

Plot the results

Now we will use matplotlib to plot the results in 2D, as a scatter plot. The scatter function takes in a list of x-coordinates, a list of y-coordinates, and a list of colors for each point. Here the two axes are PC1 and PC2, and the points to plot are formed from the first and second columns of the transformed data.

plt.scatter(x_coordinates, y_coordinates, c=colors) # example

First plot without the colors to see what you get. Then create a color dictionary that maps each label to a chosen color. Some colors have 1-letter abbreviations.

‘b’ - blue
‘g’ - green
‘r’ - red
‘c’ - cyan
‘m’ - magenta
‘y’ - yellow
‘k’ - black
‘w’ - white

After creating a color dictionary, use it to create a list of colors for all the points, then pass that in to your scatter method.

Legends are often created automatically in matplotlib, but for a scatter plot it’s a bit more difficult. We’ll actually create more plotting objects with no points, and then use these for our legend. Consider the code below. For each color, we’re creating a null-plot (the ‘o’ means use circles for plotting). Then we’re mapping these null objects to names of the flowers (create a list of names for each class).

# create legend
leg_objects = []
for i in range(3):
    circle, = plt.plot([], color_dict[i] + 'o')
    leg_objects.append(circle)
plt.legend(leg_objects,names)

At the end, you should get a picture like this. We can see that Setosa is further away from the other two subspecies with respect to PC1, suggesting that a tree relating these groups would have Versicolour and Virginica joining together first (i.e. most recently), and then finding a common ancestor with Setosa more anciently. This is the “genealogical interpretation of PCA”.

Analysis & Extension questions

Create an “elbow plot” for K-means. Does your “best” K match the true number of clusters? (could be 2 if you filtered, or 10 if you didn’t)
Apply both K-means and GMMs to the same dataset (i.e. either choose the zip code or the Iris flower dataset). Comment on the method comparison.
For PCA, what fraction of the variance is explained by the first PC? By the second? (Hint: see PCA documentation)