CS360 Lab 2: KD Trees

Due: Thursday, February 8 at 11:59pm

Overview

The goals of this lab are:

Practice OOP (especially recursive data structures) in Python
Implement a naive K-nearest neighbors algorithm as well as the KD trees algorithm
Compare and visualize the runtimes of these two algorithms
Use nearest neighbor ideas for prediction tasks

It is required to complete this lab through pair programming with a partner of your choice (email me if you would like a random partner assignment). Follow good pair programming techniques such as:

One person has these instructions open (“navigator”) and one person is coding (“driver”)
Switch the driver and navigator every 20-30 minutes
Both partners contribute to the discussion and understand all code that has been written
Partners should not “divide and conquer” the lab – all lab work should be done together unless absolutely necessary due to scheduling constraints

One other note: style and documentation will be graded for this and future labs. Make sure to follow the Python style notes section of this lab from CS260.

Part 1: dataset introduction

This dataset is based on the Communities and Crime data which:

“combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.”

You can find the data here:

/home/smathieson/Public/cs360/communities/communities.names_filter
/home/smathieson/Public/cs360/communities/communities.train
/home/smathieson/Public/cs360/communities/communities.test
/home/smathieson/Public/cs360/communities/communities.rand

If you have a haverford-cs folder that contains all your repo folders, along with a data folder, you can copy the data like this (assuming you’re in your cs360-lab2-name1-name2 directory):

scp smathieson@cook.cs.haverford.edu:/home/smathieson/Public/cs360/communities/communities.* ../data/

The training and test data have 102 columns: 101 features and one value (at the end) to be predicted (ViolentCrimesPerPop). See the communities.names_filter file for a list of all the attributes, and see communities.names for more information about each one.

Analyze the training data only for now. Using np.loadtxt, read in communities.train using a similar command line argument method as Lab 1:

python3 runtime_compare.py -r ../data/communities.train

After loading the data, you should get the following shape:

training data: (1595, 102)

Save the first 101 columns as the features and the last column as the value to be predicted.

Part 2: naive k-nearest neighbors

One way to predict values for the test dataset is:

For each test example x
Find the distance from x to each point in the training data
Take the training point that corresponds to the minimum distance (i.e. the “nearest neighbor”)
Return the label of the nearest neighbor

Within runtime_compare.py, implement this algorithm as a helper function(s). For right now, you don’t need to worry about prediction, just create a function that returns the nearest neighbor. Think about a top-down-design (TDD) that will allow your main function to be clean and concise.

Test your function with the random test dataset:

python3 runtime_compare.py -r ../data/communities.train -e ../data/communities.rand

Edit: initially just implement this for k=1.

Part 3: KD tree implementation

Next, in KDtree.py, create a kd-tree data structure from a list of points (i.e. the features of the training data). Here is a suggested list of classes/steps, although they could be implemented in a different order.

Create a class Point that can calculate the (Euclidean) distance to another Point.
Create a class Node that stores a Point at its root and has pointers to left and right children.
Create a median finding algorithm (it’s ok to sort and then take the middle) that takes a list of Points and a dimension and returns the median Point from that dimension as well as a list of the Points before the median and a list of Points after the median (neither of which include the median).
Create a KDTree class that can create a KD tree when given a list of Points. For debugging purposes, you may want to add the functionality to print out the tree as well.

Nearest neighbor implementation

Once you have a working kd-tree, a very important helper method inside the KDTree class is the nearest neighbor finder. Following the pseudocode in class, this recursive method should take in:

the root of the current subtree
the query
the current depth
the current nearest neighbor
the distance to this neighbor (i.e. the current minimum distance)

Think about the types for each of these parameters, as well as how you might call your recursive helper for each new test point. The return value should be the nearest neighbor, but it may also be helpful to return the min distance. To test your implementation, one way is to make sure you’re getting the same results with the naive version vs. the kd-tree version.

Part 4: Runtime comparison

In this section you will set up experiments to test the runtime of both algorithms (just finding one nearest neighbor, not prediction yet). The idea is to create a plot of runtime on the y-axis vs. n (the number of training points) on the x-axis.

I would recommend using the random test set (all 100 points each time) against larger and larger subsets of the training data. For each value of n, calculate the runtime of both the naive algorithm and kd-trees with:

import time

start = time.time()
<run algorithm on test data>
end = time.time()

It is okay to ignore the runtime of creating the kd-tree. At the end of this step you should have a single plot with n (number of training points) on the x-axis, runtime on the y-axis (any units), two curves (one for naive and one for kd-trees), and a legend. Save your figure as:

figs/runtime.pdf

Part 5: Prediction on real test data

Extend to k > 1

Extend your naive algorithm to k > 1, using a heap to keep track of distances you have seen before. At the end of this step you should be able to return the k nearest neighbors instead of just the nearest one.

Prediction

Now use your ~~kd-tree~~ naive implementation to predict last column of test data (the real test data, not the random data). This should be accomplished with a similar command line:

python3 prediction.py -r ../data/communities.train -e ../data/communities.test -k 1

Where k is the number of neighbors to use. Start with k=1 and compute the accuracy of your results using mean squared error (MSE) as discussed in class. Then increase k until you see a clear pattern, averaging the values of the k-nearest neighbors each time.

Finally, save your results as a plot with k in the x-axis and MSE on the y-axis. Make sure to include axis labels. Save your plot as:

figs/prediction.pdf

Edit: it is optional to extend kd-trees to more than one neighbor.

Analysis Questions and submitting your work

Make sure to fill out the README.md with answers to all the analysis questions! (listed below as well) Be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded.

Interpret your runtime figure. What is the theoretical runtime of the naive algorithm vs. the kd-trees algorithm? Do your plots reflect this? If not, what other factors might be influencing the runtime?
Interpret your plot of prediction accuracy vs. k. What was the optimal value of k? What might be happening as k becomes too large? Too small?
Read and analyze the file communities.names, which contains information about all the features of the data. Some of the features have been removed due to missing data. Explain why some of the data might be missing and how you might attempt to fill in missing values using a k-nearest neighbors approach.
Do you think that your trained model, given updated feature values, would accurately predict crime in these communities in 1996? How about for this year? Why or why not?

Extension (optional, but recommended!)

If you do the extension below, document the process and results in your README.md.

Instead of averaging the labels or values associated with the k-nearest neighbors of a query, implement a weighted average. Think about how you might weight each neighbor, and document any improvement in your results over a basic average.