Lab 2: K-means
in-class
For this lab, please work with your randomly assigned partner, with code on one computer.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Look at the arguments that can be passed in to the constructor for k-means. The most important for this lab will be n_clusters. Also look at the available methods. We will again use fit and predict, but only passing in the features (i.e. X).
Create a new Python file called lab2.py. We will be using the same zip code data, so copy over the code that reads in this data and filters it for just the data points with label 2 or 3. Also make sure that you've removed the first column (Y) and stored it separately from the features (X). Only the features should be passed into k-means. If you have an accuracy function that takes in the true labels and predicted labels, copy that over as well.
Create an instance of k-means, passing in the desired number of clusters. Then use the fit method with the training data, which will determine the cluster centers. After this, use the predict method (again on the training data) to predict the cluster membership of each training example.
Since we do have labels in this case, we can evaluate the accuracy of this prediction. Pass in the output of predict and the true training labels into your accuracy method (or write an accuracy method) and print the result. If this is very low, what might be going on? We can also assess the accuracy on the test data, which we did not use to create the model. How does this accuracy compare to the training data?