CSC 390: Topics in Artificial Intelligence

Lab 2: K-means

in-class

In this lab we will explore k-means, a very popular clustering algorithm. We will use the same data as for Homework 1, which is labeled. To train the model however, we will not use the labels (which makes this method unsupervised learning). But since we have them, we have an additional way of assessing accuracy.

For this lab, please work with your randomly assigned partner, with code on one computer.

Step 1: Documentation

Often in machine learning, you will not be writing a method from scratch, and will have to learn how to use an existing library or class. We saw this for linear regression in Homework 1. Similarly, k-means is implemented in sklearn, and the documentation is below:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Look at the arguments that can be passed in to the constructor for k-means. The most important for this lab will be n_clusters. Also look at the available methods. We will again use fit and predict, but only passing in the features (i.e. X).

Step 2: Prepare the data

Create a new Python file called lab2.py. We will be using the same zip code data, so copy over the code that reads in this data and filters it for just the data points with label 2 or 3. Also make sure that you've removed the first column (Y) and stored it separately from the features (X). Only the features should be passed into k-means. If you have an accuracy function that takes in the true labels and predicted labels, copy that over as well.

Step 3: Fit the model

Create an instance of k-means, passing in the desired number of clusters. Then use the fit method with the training data, which will determine the cluster centers. After this, use the predict method (again on the training data) to predict the cluster membership of each training example.

Step 4: Assess the accuracy

Since we do have labels in this case, we can evaluate the accuracy of this prediction. Pass in the output of predict and the true training labels into your accuracy method (or write an accuracy method) and print the result. If this is very low, what might be going on? We can also assess the accuracy on the test data, which we did not use to create the model. How does this accuracy compare to the training data?

Step 5: Unfiltered data

So far, we have just considered two classes, but now we are in a position to consider all of them. Perform the same analysis with the unfiltered dataset. What changes need to be made? Again assess the accuracy. It is probably very low. Why is that? How can you post-process the k-means labels to properly consider the accuracy?