The goals of this lab are:

- Practice OOP (especially recursive data structures) in Python
- Implement a naive K-nearest neighbors algorithm as well as the KD trees algorithm
- Compare and visualize the runtimes of these two algorithms
- Use nearest neighbor ideas for prediction tasks

It is required to complete this lab through pair programming with a partner of your choice (email me if you would like a random partner assignment). Follow good pair programming techniques such as:

- One person has these instructions open (“navigator”) and one person is coding (“driver”)
- Switch the driver and navigator every 20-30 minutes
- Both partners contribute to the discussion and understand all code that has been written
- Partners should not “divide and conquer” the lab – all lab work should be done together unless absolutely necessary due to scheduling constraints

One other note: style and documentation will be graded for this and future labs. Make sure to follow the *Python style notes* section of this lab from CS260.

This dataset is based on the Communities and Crime data which:

“combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.”

You can find the data here:

```
/home/smathieson/Public/cs360/communities/communities.names_filter
/home/smathieson/Public/cs360/communities/communities.train
/home/smathieson/Public/cs360/communities/communities.test
/home/smathieson/Public/cs360/communities/communities.rand
```

If you have a `haverford-cs`

folder that contains all your repo folders, along with a `data`

folder, you can copy the data like this (assuming you’re in your `cs360-lab2-name1-name2`

directory):

`scp smathieson@cook.cs.haverford.edu:/home/smathieson/Public/cs360/communities/communities.* ../data/`

The training and test data have 102 columns: 101 features and one value (at the end) to be predicted (ViolentCrimesPerPop). See the `communities.names_filter`

file for a list of all the attributes, and see `communities.names`

for more information about each one.

Analyze the training data only for now. Using `np.loadtxt`

, read in `communities.train`

using a similar command line argument method as Lab 1:

`python3 runtime_compare.py -r ../data/communities.train`

After loading the data, you should get the following `shape`

:

`training data: (1595, 102)`

Save the first 101 columns as the features and the last column as the value to be predicted.

One way to predict values for the test dataset is:

- For each
*test*example`x`

- Find the distance from
`x`

to each point in the training data - Take the training point that corresponds to the minimum distance (i.e. the “nearest neighbor”)
- Return the label of the nearest neighbor

Within `runtime_compare.py`

, implement this algorithm as a helper function(s). For right now, you don’t need to worry about prediction, just create a function that returns the nearest neighbor. Think about a top-down-design (TDD) that will allow your `main`

function to be clean and concise.

Test your function with the *random* test dataset:

`python3 runtime_compare.py -r ../data/communities.train -e ../data/communities.rand`

Edit: initially just implement this for k=1.

Next, in `KDtree.py`

, create a kd-tree data structure from a list of points (i.e. the features of the training data). Here is a suggested list of classes/steps, although they could be implemented in a different order.

Create a class

`Point`

that can calculate the (Euclidean) distance to another`Point`

.Create a class

`Node`

that stores a`Point`

at its root and has pointers to left and right children.Create a median finding algorithm (it’s ok to sort and then take the middle) that takes a list of

`Point`

s and a dimension and returns the median`Point`

from that dimension as well as a list of the`Point`

s before the median and a list of`Point`

s after the median (neither of which include the median).Create a

`KDTree`

class that can create a KD tree when given a list of`Point`

s. For debugging purposes, you may want to add the functionality to print out the tree as well.

Once you have a working kd-tree, a very important helper method inside the `KDTree`

class is the nearest neighbor finder. Following the pseudocode in class, this recursive method should take in:

- the root of the current subtree
- the query
- the current depth
- the current nearest neighbor
- the distance to this neighbor (i.e. the current minimum distance)

Think about the types for each of these parameters, as well as how you might call your recursive helper for each new test point. The return value should be the nearest neighbor, but it may also be helpful to return the min distance. To test your implementation, one way is to make sure you’re getting the same results with the naive version vs. the kd-tree version.

In this section you will set up experiments to test the runtime of both algorithms (just finding one nearest neighbor, not prediction yet). The idea is to create a plot of runtime on the y-axis vs. `n`

(the number of training points) on the x-axis.

I would recommend using the random test set (all 100 points each time) against larger and larger subsets of the training data. For each value of `n`

, calculate the runtime of both the naive algorithm and kd-trees with:

```
import time
start = time.time()
<run algorithm on test data>
end = time.time()
```

It is okay to ignore the runtime of creating the kd-tree. At the end of this step you should have a single plot with `n`

(number of training points) on the x-axis, runtime on the y-axis (any units), two curves (one for naive and one for kd-trees), and a legend. Save your figure as:

`figs/runtime.pdf`

Extend your *naive* algorithm to `k > 1`

, using a heap to keep track of distances you have seen before. At the end of this step you should be able to return the `k`

nearest neighbors instead of just the nearest one.

Now use your ~~kd-tree~~ *naive* implementation to *predict* last column of test data (the real test data, not the random data). This should be accomplished with a similar command line:

`python3 prediction.py -r ../data/communities.train -e ../data/communities.test -k 1`

Where `k`

is the number of neighbors to use. Start with `k=1`

and compute the accuracy of your results using mean squared error (MSE) as discussed in class. Then increase `k`

until you see a clear pattern, *averaging* the values of the k-nearest neighbors each time.

Finally, save your results as a plot with `k`

in the x-axis and MSE on the y-axis. Make sure to include axis labels. Save your plot as:

`figs/prediction.pdf`

Edit: it is *optional* to extend kd-trees to more than one neighbor.

Make sure to fill out the `README.md`

with answers to all the **analysis questions**! (listed below as well) Be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded.

Interpret your runtime figure. What is the theoretical runtime of the naive algorithm vs. the kd-trees algorithm? Do your plots reflect this? If not, what other factors might be influencing the runtime?

Interpret your plot of prediction accuracy vs.

`k`

. What was the optimal value of`k`

? What might be happening as`k`

becomes too large? Too small?Read and analyze the file

`communities.names`

, which contains information about all the features of the data. Some of the features have been removed due to missing data. Explain why some of the data might be missing and how you might attempt to fill in missing values using a k-nearest neighbors approach.Do you think that your trained model, given updated feature values, would accurately predict crime in these communities in 1996? How about for this year? Why or why not?

If you do the extension below, document the process and results in your `README.md`

.

Instead of averaging the labels or values associated with the k-nearest neighbors of a query, implement a *weighted* average. Think about how you might weight each neighbor, and document any improvement in your results over a basic average.