CS66 Lab 6: SVMs and Cross-Validation

Due: Friday, April 5 at 11:59pm

Overview

The goals of this week's lab:

Become comfortable reading sklearn documentation
Practice using sklearn algorithm implementations
Use cross-validation to perform hyper-parameter tuning
Analyze and compare ML algorithms using statistical tests
Become more familiar with SVM notation and geometric intuition
Fill in gaps in the derivation of our SVM optimization problem

Note that there is a check in for this lab. By lab on April 3 you should have either completed the coding portion or the problem set (and should be started on the remaining part).

You must work with one lab partner for this lab. You may discuss high level ideas with any other group, but examining or sharing code is a violation of the department academic integrity policy.

Introduction

To get started, find your git repo for this lab assignment in the cs66-s19 github organization.

Clone your git repo with the starting point files into your labs directory:

$ cd cs66/labs
$ git clone [the ssh url to your your repo]

Then cd into your lab06-id1_id2 subdirectory. You will have or create the following files:

run_pipeline.py - your main program executable for tuning and testing your algorithms and outputting error rates
generate_curves.py - put code here to generate your learning curves
README.md - required response for data collection and grading
analysis.pdf - a report on the results of your experiments. This is the most important deliverable for this lab.
lab06_pset.pdf - answers to problem set questions.

Datasets and Command line arguments

Note: the lab computers currently have sklearn version 19, so that is the documentation you should be using for this lab. Throughout the lab, consult the documentation frequently to find the appropriate methods for this lab. Reading and using documentation is a very important skill that we will practice in Lab 6, Lab 7, and the final project. Make sure you really understand each line of code you're writing and each method you're using - there are fewer lines of code for this lab, but each line is doing a lot!

When importing sklearn modules, avoid importing with the * - this imports everything in the library (and these functions can be confused with user-defined functions). Instead, import functions and classes directly. For example:

from sklearn.datasets import fetch_mldata, load_breast_cancer

Imports should be sorted and grouped by type (for example, usually I group imports from python libraries vs. my own libraries).

Rather than parse and process data sets, you will use sklearn's pre-defined data sets. Details can be found here. At a minimum, your experiments will require using the MNIST and 20 Newsgroup datasets. Both are multi-class tasks (10 and 20 classes, respectively). Note that both of these are large and take time to run, so I recommend developing using the Wisconsin Breast Cancer dataset (example below).

For this lab, your run_pipeline.py file should take in one command line argument (using the optparse library), the dataset name. This does not refer to a file, but allows the program to import the correct dataset (options: cancer, mnist, and news). Here is an example:

if opts.dataset == "cancer":
    data = load_breast_cancer()

X = data['data']
y = data['target']
print(X.shape)
print(y.shape)

which outputs 569 examples with 30 features each:

(569, 30)
(569,)

The MNIST dataset is very large and takes a lot of time to run, so you can randomly select 1000 examples; you should also normalize the pixel values between 0 and 1 (instead of 0 and 255):

data = fetch_mldata('MNIST Original', data_home="/home/smathieson/public/cs66/sklearn-data/")
X = data['data']
y = data['target']
X,y = utils.shuffle(X,y) # shuffle the rows (utils is from sklearn)
X = X[:1000] # only keep 1000 examples
y = y[:1000]
X = X/255 # normalize the feature values

The newsgroup dataset in vector form (i.e., bag of words) is obtained using:

data = fetch_20newsgroups_vectorized(subset='all', data_home="/home/smathieson/public/cs66/sklearn-data/")

No normalization is required; I also suggest randomly sampling 1000 examples for this dataset as well. The data object also contains headers and target information which you should examine for understanding. For your analysis, it may be helpful to know the number of features, their types, and what classes are being predicted.

Coding Requirements

The coding portion is flexible - the goal is to be able to execute the experiments below. However, you should keep these requirements in mind:

Make your code reusable and modular. You shouldn't hardcode the datasets/algorithms you want. Make the user define the dataset on the command line, and abstract away the algorithm of choice.
Use command line arguments to help identify the dataset you want to load. So you should be able to run:

$ python3 run_pipeline.py -d cancer
$ python3 run_pipeline.py -d mnist
$ python3 run_pipeline.py -d news

You may have cases for each one (since they need to be treated differently). If the user does not enter a dataset, rely on optparse to print a helpful message.

Functions should also be modular. I recommend having a runTuneTest(learner, params, X, y) method that takes in the base learner (i.e., RandomForestClassifier or SVC), the hyper-parameters to tune as a dictionary and all of the data. This method will then handle creating train/tune/test sets and running the pipeline. For example, I could create a K-Nearest Neighbor classifier:

knn_clf = KNeighborsClassifier()
parameters = {"weights": ["uniform", "distance"], "n_neighbors": [1, 5, 11]}
test_results = runTuneTest(clf, parameters, X, y)

Note that the hyper-parameters match the API for KNeighborsClassifier. In the dictionary, the key is the name of the hyper-parameter and the value is a list of values to try.

generate_curves.py should follow similar constraints, though the names and styles of methods will be different.
Keep your code simple and easy to read. Let sklearn do most of the heavy lifting. Most of your time will be spent reading the API and finding the appropriate methods to call.

Experiment 1: Random Forest vs SVM Generalization Error

Using run_pipeline.py, you will run both Random Forests and SVMs and compare which does better in terms of estimated generalization error.

Coding Details

Your program should read in the dataset using the command line, as discussed above. You should specify your parameters and classifier and call runTuneTest (see the above example), which follows this sequence of steps:

Divide the data into training/test splits using StratifiedKFold. Stratified folds balance the number of examples from each class in each fold. Follow the example in the documentation to create a for-loop for each fold. Set the parameters to shuffle the data and use 5 folds. Set the random_state to a fixed integer value (e.g., 42) so the folds are consistent for both algorithms.
For each fold, tune the hyper-parameters using GridSearchCV, which is a wrapper for your base learning algorithms; it automates the search over multiple hyper-parameter combinations. Use the default value of 3-fold tuning (so we are essentially performing a cross-validation within a cross-validation).
After creating a GridSearchCV classifier, fit it using your training data. Report the training score using the score method.
Get the test-set accuracy by running the score method on the fold's test data.
Return a list of test accuracy scores for each fold.

In main(), you should print the test accuracies for all 5 folds for both classifiers (pair up the accuracies for each fold for ease of comparison). The classifiers/hyper-parameters are defined as follows:

Your Random Forest Classifier should fix the number of trees at 200, but tune the number of features between 1%, 10%, 50%, 100%, and square-root of the number of features in the dataset. (Optional: you may also investigate the depth of the tree and/or the best feature criteria, i.e. entropy or gini.)
Your Support Vector Machine should use the Gaussian kernel (rbf) and tune both the error term penalty (C) parameter (1, 10, 100, 1000) and gamma parameter (10^-4,10^-3,10^-2,10^-1,1). Part of your analysis will be to investigate these parameters and what their different values mean.

Code incrementally, and be sure to examine the results of your tuning (what were the best hyper-parameter settings? what were the scores across each parameter?) to ensure you have the pipeline correct. Since the analysis below is dependent on your results, I cannot provide sample output for this task. However, this is what is generated if I change my classifier to K-Nearest Neighbors using the parameters listed in the previous section (you can try to replicate this using a random_state of 42):

$ python3 run_pipeline.py -d cancer

-------------
KNN
-------------

Fold 1:
Best parameters: {'n_neighbors': 5, 'weights': 'distance'}
Training Score: 1.000

Fold 2:
Best parameters: {'n_neighbors': 5, 'weights': 'distance'}
Training Score: 1.000

Fold 3:
Best parameters: {'n_neighbors': 11, 'weights': 'distance'}
Training Score: 1.000

Fold 4:
Best parameters: {'n_neighbors': 5, 'weights': 'distance'}
Training Score: 1.000

Fold 5:
Best parameters: {'n_neighbors': 11, 'weights': 'uniform'}
Training Score: 0.934

Fold, Test Accuracy
0, 0.913
1, 0.939
2, 0.938
3, 0.920
4, 0.956

Analysis

In Part 1 of your writeup (must be a PDF), you will analyze your results. At a minimum, your submission should include the following type of analysis:

Provide quantitative results. Present the results visually both in summary and detail (i.e., a table). What is the average test accuracy of each method? Report the p-value of a paired t-test, which will test whether we can reject the null hypothesis of the methods having similar generalization error. (TIP: you can use the scipy library to do this). Can we reject the null hypothesis at p < 0.05?
Qualitatively assess the results. What can we conclude/infer about both methods and how did the methods compare to each other?
Did one set of hyper-parameters dominate or did they vary across the folds? Can you explain this using properties for each algorithm as discussed in class? What, if any, hyper-parameters were commonly chosen for each dataset?

You do not need to go into exhaustive detail, but your analysis should be written as if it were the results section of a scientific report/paper.

Experiment 2: Learning Curves

Using generate_curves.py, you will generate learning curves for the above two classifiers. We will vary one of the hyper-parameters and see how the train and test error accuracy changes.

Coding Requirements

Follow the same guidelines for loading the data as in Experiment 1 (you will use both MNIST and 20 Newsgroup).
For Random Forests, you will generate a learning curve for the number of trees (i.e., n_estimators). The parameter will take on all values from 1 to 201 spaced by 10 (i.e., 1, 11, 21, ..., 201). Keep all other parameters at their default values.
For Support Vector Machines, again use an RBF kernel with a fixed error term penalty parameter of 1.0 (the default). You will range over gamma values 10^-5,10^-4,10^-3,10^-2,10^-1,1,10 (HINT: use np.logspace() to easily generate this range).
To generate the data for the curve, you only need the validation_curve function, which returns the training and test set accuracies for each parameter and each fold. You will need to average the folds together; use 3-fold CV (the default).
Print (and, optionally, save to a csv file), the following for each parameter value: the parameter value, the average train accuracy across the 3 folds, and the average test accuracy across the three folds. Here is the result if I run KNeighborsClassifier with all odd k values from 1 to 21:

$ python3 generate_curves.py -d cancer
Neighbors,Train Accuracy,Test Accuracy
 1  1.0000  0.9051
 3  0.9561  0.9209
 5  0.9473  0.9227
 7  0.9438  0.9191
 9  0.9429  0.9315
11  0.9385  0.9315
13  0.9385  0.9280
15  0.9376  0.9245
17  0.9367  0.9174
19  0.9323  0.9174
21  0.9306  0.9157

Analysis

Analyze your results for experiment 2. At a minimum, you should have:

A learning curve for each experiment and each method. Each learning curve (4 in total) should have both the training and test accuracy, clearly labeled axes, and a legend. For the SVM curves, plot your x-axis in log space to evenly space the points.
Analyze each of your 4 learning curves. Your discussion should describe the results and related it to relevant course topics such as bias/variance as well as overfitting/underfitting.

SVM Problem Set

Complete the following SVM problems. You can either submit these on paper to the dropbox outside my office (Science Center 249) or put them on git as either a scanned image or use LaTeX. The latex template is in your repos if you want to use that for your solutions. There is also a comment about how to make your solutions in a different color.

Optional Extensions

For the SVM method, do some investigation into the support vectors (the SVC class in sklearn has some attributes that allow you to see the support vectors). How many support vectors are typical for these datasets? How can you determine the size of the margin? Does there seem to be a relationship between the size of the margin and the quality of the test results?
There are many other parameters we did not tune in these methods, and many values we did not consider. Expand your analysis. Some suggestions: entropy vs. Gini for Random Forests, max tree depth for Random Forests, and other kernels for SVMs.

Submitting your work

For the programming portion, be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded. Only files in the main directory will be graded. Please double check all requirements; common errors include:

Did you fill out the README?
Did you remove all debugging/developer comments including TODOs and print statements?
Are all functions and non-trivial portions of code commented and easy to follow?
Recheck your code on a lab machine to make sure it runs. Students often have whitespace errors are doing a last past of adding/removing comments.

Acknowledgements: modified from materials by Ameet Soni