CS360 Lab 1: K-Nearest Neighbors

Due: Tuesday, September 15 at 11:59pm EDT

Grace period: will be collected Wednesday at noon midnight


Before you start

Read the entire syllabus on the course webpage


Math Pre-exam

Complete the following math pre-exam. This will be graded on effort, not correctness - the purpose of this part of Lab 1 is so that I know how much math review to do later on. You have a few options for working on this part:

  1. Print it out, write down your answers on the pages.
  2. Write your answers on a separate sheet of paper, but clearly number the questions.
  3. LaTeX your answers.

For the first two approaches, “scan” the pages using a phone (I recommend the app “CamScanner”) to create a single PDF. For all approaches, add the final PDF to your Lab 1 repository on GitHub classroom (and make sure to commit and push!)


Overview

The goal of this lab is to implement a supervised machine learning algorithm (in this case K-nearest neighbors) and apply it to a real dataset. Along the way you should familiarize yourself with some of the terminology we have been discussing in class. You will also get a chance to practice working with large datasets (and practice Python if it has been a while).

This first part of this lab (short numpy introduction) will be done in randomly assigned pairs, but the remainder of the lab should be done individually. You may discuss concepts with fellow classmates, but you may not share or view code.

Your starting files (and submissions) will be handled using GitHub classroom. You can first accept the assignment using the link from Piazza. After your repository has been set up online, open the command line locally and clone your repository.

$ git clone git@github.com:haverford-cs/cs360-lab1-[username].git
$ cd cs360-lab1-[username]
$ ls
README.md   knn.py

After making changes, follow this procedure to commit and push your work:

$ git status    # see which files have changed or been added
$ git add <filename>
$ git status    # make sure desired file(s) show up in green
$ git commit -m "informative message"
$ git push

Make sure to push your work frequently to avoid lost progress. A few recommendations for workflow:


Numpy introduction (pair programming)

In lab (on Thursday), find your randomly assigned partner in your breakout room. The person whose first name comes first alphabetically should share their screen first and open up a python3 interpreter on the command line. Throughout this short introduction, make sure you both understand everything! Note: if only one of your has python3 and/or numpy, feel free just to use that screen.

Some steps below are borrowed from numpy quickstart tutorial, which I encourage you to reference throughout the course.

During this course we will often use numpy to manage our datasets. Type the following commands in the python3 interpreter (filling in missing pieces) and make sure the output makes sense:

Array creation and modification

python3
>>> import numpy as np
>>>
>>> # create a 4x5 array of zeros
>>> m = np.zeros([4,5])
>>> m
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
>>>
>>> # create a 3x5 array of zeros using reshape
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0,  1,  2,  3,  4],
      [ 5,  6,  7,  8,  9],
      [10, 11, 12, 13, 14]])
>>>
>>> # create an array from a list
>>> b = np.array([6, 7, 8])
>>> b
array([6, 7, 8])

TODO and discuss:

Switch who is sharing their screen and typing

Array slicing and concatenation

Try out the slicing operations below, predicting what will be printed each time. Make sure you have modified the m array enough that output is clear (you may need to recreate m if you switched who was sharing the screen).

>>> m[2]
>>> m[:,1]
>>> m[:3,:2]

TODO and discuss:

optional: save your interpreter commands to a file and email to each partner


Dataset introduction (individual work from here)


Step 1: Look at the Data

Make sure to always use python3!

Open knn.py in your preferred editor. This file is mostly blank.

The data for Lab 1 is a set of handwritten digits from zip codes written on hand-addressed letters. The dataset is separated into “train” and “test”, and is available here. Download this data and put it in the directory right above your github repo (not inside the repo - please don’t commit or push this data, it is very large!)

Read about this dataset by going to the Elements of Statistical Learning website, ESL, then clicking on the “Data” tab, then clicking on the “Info” for the zip code dataset. It is similar to the MNIST dataset (examples shown below):

Use the command less in the terminal to view the beginning of each file. Both datasets have the same format: the first column is the “label” (here an integer between 0 and 9, inclusive, that corresponds to the identity of a hand-written zip code digit), and the rest of each row is made up of gray-scale values corresponding to the image of this hand-written digit.

One useful technique is to load a dataset from a file into a numpy array. Here is an example:

import numpy as np
train_data = np.loadtxt("path/to/train/file")

In the file knn.py, test this approach by printing train_data. Make sure the data format makes sense. You can also import the test data in this way. After you load the data, print the shape. The shape of a numpy array represents the dimensions. What is the shape of the training data? Of the testing data?

print(train_data.shape)
print(test_data.shape)

Step 2: Filter the Data

To start, we’ll just consider two classes, but here we have 10. We’ll get to such problems later, but for now, devise a way to retain only the rows which have label 2 or 3. Do this for both the train and test data.

One important note: it may be convenient to relabel the 2’s to -1’s and the 3’s to +1’s, since this will work better with our methods later on (but you don’t have to do this).


Implement K-nearest neighbors

The main goal of the lab is to implement the K-nearest neighbors algorithm to predict the class of each example from the test dataset. Exactly how you implement this part is up to you, but your code should be decomposed into functions, well commented, and easy to understand. Here are some suggestions:

Classification

Create a function that takes as input (at least) a test example and an integer k, and outputs a prediction based on a nearest-neighbor classifier. This function will loop through all the training examples, find the distance between each one and the input test example, and then find the K nearest neighbors (you are welcome to use numpy sorting methods, but look up how they work first). For this subroutine, we’ll need a distance function.

Distance function

An important part of many machine learning methods is the concept of “distance” between examples. We often phrase this as a “metric” on our inputs. Create a function that takes as input two examples (any two examples, although in this case we’ll use it with one test and one train), and outputs the distance (we’ll use Euclidean for now) between them. Although there are many built-in functions the perform this task, please implement your distance function from scratch. However, you are welcome to use numpy functions as part of it (for example, you may use np.sum and similar functions, but look up how they work first).

Quantify the accuracy

Loop through all the filtered test examples, using your classification function to predict the label for each one. Also create a way of determining if the prediction was correct or not, based on the labels of the test data. Compute the fraction or percentage of correctly predicted examples. How does this change as K varies? Try K 1-10 (at least) and record the accuracy.

Output

When I run your program on the command line as shown below:

python3 knn.py

You should output something like this (round accuracy to 3 decimal places):

Nearest Neighbors:
K=1, xx.xxx%
K=2, xx.xxx%
...
K=10, xx.xxx%

Analysis Questions and submitting your work

Make sure to fill out the README.md with answers to all the analysis questions! Also make sure to add your math diagnostic to the github repo. Be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded.

Credit: based on Exercise 2.8 from “The Elements of Statistical Learning”


If you do any of the extensions, document this in your README.md (especially if you included command line arguments). I occasionally demo excellent extensions in class.

  1. Extend your algorithm to a multi-class setting (i.e. distinguish between 3 or more digits). How does this change your best value of K?

  2. If you are familiar with confusion matrices, create one for this test dataset and your “best” value of K.

  3. Create a plot of accuracy vs. K.

  4. Visualize some of the examples that were classified incorrectly. The examples are 16x16 gray-scale images, so you can plot them on a grid.

  5. Create command line arguments for which digits to include (here we chose 2 and 3, but you could let the user decide). Or you could let the user decide how much training data to use, what K value to use, or how many classes to use.