CSC 390: Topics in Artificial Intelligence

Lab 1: Nearest neighbors

in-class, to be turned in as part of Homework 1

The goal of this lab is to introduce a foundational method in supervised learning, nearest neighbors. Most of this course is dedicated to unsupervised learning, but to understand the distinction, we first consider a problem where we know the labels of the data. We'll also practice a bit of backwards software design.

Before starting this lab, make sure to finish Lab 0.

Step 1: Classify function

Create a Python function that takes as input a TEST example and an integer k, (it may also be helpful to take in the entire TRAINING dataset, including the labels, but you can also leave these as global variables) and outputs a prediction based on a nearest-neighbor classifier. This function will loop through all the TRAINING examples, find the distance between each one and the input test example, and then find the k nearest neighbors. For this subroutine, we'll need a distance function.

Step 2: Distance function

An important part of many machine learning methods is the concept of "distance" between examples. We often phrase this as a "metric" on our inputs. Create a Python function that takes as input two training examples (any two examples, although in this case we'll use it with one test and one train), and outputs the distance (we'll use Euclidean for now) between them.

Step 3: Quantify the accuracy

Loop through all the training TEST (correction!) examples, using your classification function to predict the label for each one. During this loop, also create a way of determining if the prediction was correct or not, based on the labels of the TEST data. Compute the fraction or percentage of correctly predicted examples. How does this change as k varies? Try several values of k and record the accuracy. In the next part of Homework 1, these results can be directly compared with linear regression.

Step 4: Save your work

Make sure to save your work, since this lab will be turned in as part of Homework 1.

Credit: based on Exercise 2.8 from "The Elements of Statistical Learning"