The goal of this lab is to implement a supervised machine learning algorithm and apply it to a real dataset. Along the way you should familiarize yourself with some of the terminology we have been discussing in class. You will also get a chance to practice working with large datasets (and practice Python if it has been a while).
This lab will be done in pairs, which have been randomly assigned for this lab. After that, you will have the freedom to choose a partner. Your starting point files (and submissions) will be handled using the GitHub Enterprise interface (see below). This is similar to what most of you have seen in the CS31, CS35, and other upper-level courses. Your partner for this week will be listed in the name of your Lab 1 repo (see below). You may discuss concepts with a fellow classmate, especially if you are having difficulty with the details of transcription or translation. You may not share code, however, with students that are not your lab partner.
To get started, first create a cs66
directory in your home directory, and add a labs subdirectory to it:
mkdir cs66
cd cs66
mkdir labs
cd labs
pwd
We will be using git repos hosted on the college’s GitHub server for labs in this class. If you have not used git or the college’s GitHub server before, here are some instructions: Using Git (follow the instructions for repos on Swarthmore’s GitHub Enterprise server).
Next find your git repo for this lab assignment off the GitHub server for our class: cs66-s19
Clone your git repo with the lab 1 starter files into your labs directory:
cd ~/cs66/labs
git clone [the ssh url to your your repo]
Then cd into your lab01-id1_id2
subdirectory (id1 being replaced by the user id2 of you and your partner).
Make sure to always use python3
!
The data for Lab 1 is a set of handwritten digits from zip codes written on hand-addressed letters. The datasets is separated into “train” and “test”, and is available here:
/home/smathieson/public/cs66/zip/zip.test
/home/smathieson/public/cs66/zip/zip.train
Read about this dataset by going to the Elements of Statistical Learning website, ESL, then clicking on the “Data” tab, then clicking on the “Info” for the zip code dataset.
Use the command less
in the terminal to view the beginning of each file. Both datasets have the same format: the first column is the “label” (here an integer between 0 and 9, inclusive, that corresponds to the identity of a hand-written zip code digit), and the rest of each row is made up of gray-scale values corresponding to the image of this hand-written digit.
During this course we will be using numpy
to manage many of our datasets. numpy
arrays and slicing will often be useful. Type the following commands in the python interpreter and make sure the output makes sense:
python3
>>> import numpy as np
>>> m = np.zeros([4,5])
>>> m
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
>>> m[2][1] = 7
>>> m
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 7., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
>>> m[:,1] # slice the one-th column of m
array([ 0., 0., 7., 0.])
>>> m[2] # get the two-th row of m
array([ 0., 7., 0., 0., 0.])
>>> m[:3,:2] # 3x2 slice from upper-left
array([[0., 0.],
[0., 0.],
[0., 7.]])
One useful technique is to load a dataset from a file into a numpy
array. Here is an example:
import numpy as np
train_data = np.loadtxt("path/to/train/file")
In the file knn.py
(create this file if you’re on the waitlist), test this approach by printing train_data
. Make sure the data format makes sense. You can also import the test data in this way. After you load the data, print the shape. The shape of a numpy
array represents the dimensions. What is the shape of the training data? Of the testing data?
print(train_data.shape)
print(test_data.shape)
To start, we’ll just consider two classes, but here we have 10. We’ll get to such problems later, but for now, devise a way to retain only the rows which have label 2 or 3. Do this for both the train and test data.
One important note: it may be convenient to relabel the 2’s to 0’s and the 3’s to 1’s, since this will work better with our methods later on.
The main goal of the lab is to implement the K-nearest neighbors algorithm to predict the class of each example from the test dataset. Exactly how you implement this part is up to you and your partner, but your code should be decomposed into functions, well commented, and easy to understand. Here are some suggestions:
Create a function that takes as input (at least) a test example and an integer k, and outputs a prediction based on a nearest-neighbor classifier. This function will loop through all the training examples, find the distance between each one and the input test example, and then find the k nearest neighbors. For this subroutine, we’ll need a distance function.
An important part of many machine learning methods is the concept of “distance” between examples. We often phrase this as a “metric” on our inputs. Create a function that takes as input two examples (any two examples, although in this case we’ll use it with one test and one train), and outputs the distance (we’ll use Euclidean for now) between them. Although there are many built-in functions the perform this task, please implement your distance function from scratch.
Loop through all the filtered test examples, using your classification function to predict the label for each one. Also create a way of determining if the prediction was correct or not, based on the labels of the test data. Compute the fraction or percentage of correctly predicted examples. How does this change as k varies? Try several values of k and record the accuracy.
When I run your program on the command line as shown below:
python3 knn.py
You should output something like this:
Nearest Neighbors:
k=value1, xx%
k=value2, xx%
...
k=value5, xx%
Make sure to fill out the README.md with answers to all the analysis questions! Be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded.
Credit: based on Exercise 2.8 from “The Elements of Statistical Learning”
If you do any of the extensions, document this in your README (especially if you included command line arguments).
Extend your algorithm to a multi-class setting (i.e. distinguish between 3 or more digits). How does this change your best value of k?
If you are familiar with confusion matrices, create one for this test dataset and your “best” value of k.
Create a plot of accuracy vs. k.
Visualize some of the examples that were classified incorrectly. The examples are 16x16 gray-scale images, so you can plot them on a grid.
Create command line arguments for which digits to include (here we chose 2 and 3, but you could let the user decide). Or you could let the user decide how much training data to use, what k value to use, or how many classes to use.
Name 1:
Name 2:
userId1:
userId2:
Number of Late Days Using for this lab:
What values of k did you try?
Which value of k produced the highest accuracy? What general trends did you observe as k increased?
When using the entire training dataset, what are your observations about the runtime of K-nearest neighbors? List 1-2 ideas for making this algorithm faster.
(None of your answers below will affect your grade; this is to help refine lab assignments in the future)
Approximately, how many hours did you take to complete this lab? (provide your answer as a single integer on the line below)
How difficult did you find this lab? (1-5, with 5 being very difficult and 1 being very easy)
Describe the biggest challenge you faced on this lab: