Read the entire syllabus on the course webpage

This lab should serve as a review of some CS260 topics, including:

- Working with real-world data
- Binary classification
- Logistic regression
- Naive Bayes
- Evaluation metrics including: accuracy, confusion matrices, and ROC curves
- Visualizing results
`sklearn`

built-in classes and methods

This first part of this lab (a short review about object-oriented programming and custom data structures) will be done in randomly assigned **pairs**, but the remainder of the lab should be done individually. You may discuss concepts with fellow classmates, but you may not share or view code.

Your starting files (and submissions) will be handled using GitHub Classroom.

The recommended editor for this course is VScode, which is installed on the lab machines. You are welcome to use a different editor, as long as we can run your code on the terminal using the commands specified below.

In lab (on Tuesday), find your randomly assigned partner. If you are on the waitlist, find a random partner. For this part you should be on *one* machine. The person whose first name comes first alphabetically should be the “driver” (at the keyboard) and the other person should be the “navigator”. Throughout this short introduction, make sure you both understand everything!

Open the file `data/colleges.txt`

, which lists colleges along with recent enrollment numbers. The goal is to read this file and extract the information for just the Tri-Co colleges (Bryn Mawr, Haverford, and Swarthmore). `main`

has been provided.

Create a class

`Consortium`

whose constructor takes two parameters (in addition to`self`

): a filename and a list of colleges to extract. The exact details of the constructor are up to you.Create a

`__str__`

method that will produce a printout of the colleges (and their enrollments), similar to the one below.Create an

`enrollment`

method that will compute and print the total enrollment across the consortium.

Overall, when I run

`python3 pair_exercise.py`

I should obtain the following printout:

```
Consortium:
BrynMawr: 1709
Haverford: 1421
Swarthmore: 1620
total enrollment: 4750
```

Note that nothing about your class should be specific to the Tri-Co.

Add both your names to the documentation and email this file to the partner who doesn’t have it, so they can add/commit/push the code to their repo as well. (We will have formal partner work in this class where you both have access to the git repo, but not for this first lab.)

**Make sure to always use python3!**

The data for Lab 1 is a set of handwritten digits from zip codes written on hand-addressed letters. The datasets is separated into “train” and “test”, and is available here:

```
/home/smathieson/Public/cs360/zip/zip.train
/home/smathieson/Public/cs360/zip/zip.test
```

If you would like to work on your own machine you can copy the data to a folder of your choice using `scp`

(secure copy):

`scp smathieson@cook.cs.haverford.edu:/home/smathieson/Public/cs360/zip/zip.train ~/Desktop/zip.train`

Just make sure not to add/commit/push the data to your github repo! Also note that we do not have the bandwidth to debug issues on your own machine. Please use the lab machines if your machine is not working well for the assignments.

You can read about this dataset by going to the Elements of Statistical Learning website, ESL, then clicking on the “Data” tab, then clicking on the “Info” for the zip code dataset. It is similar to the MNIST dataset (examples shown below):

Use the command `less`

in the terminal to view the beginning of each file:

`less /home/smathieson/Public/cs360/zip/zip.train`

(Use `q`

to exit `less`

.) Both datasets have the same format: the first column is the “label” (here an integer between 0 and 9, inclusive, that corresponds to the identity of a hand-written zip code digit), and the rest of each row is made up of gray-scale values corresponding to the image of this hand-written digit.

One useful technique is to load a dataset from a file into a `numpy`

array. Here is an example:

```
import numpy as np
train_data = np.loadtxt("path/to/train/file")
```

In the file `lab01.py`

, test this approach by printing `train_data`

. Make sure the data format makes sense. You can also import the test data in this way. After you load the data, print the *shape*. The shape of a `numpy`

array represents the dimensions. What is the shape of the training data? Of the testing data?

```
print(train_data.shape)
print(test_data.shape)
```

To make this more flexible, we will use command line arguments (similar to many labs in CS260). For this lab `-r`

will represent the training data, `-e`

will represent the testing data, and `-d`

will represent the digit we wish to classify (against all the other digits). For example:

`python3 lab01.py -r /home/smathieson/Public/cs360/zip/zip.train -e /home/smathieson/Public/cs360/zip/zip.train -d 5`

Add a helper function for command line arguments, following this template from CS260 (all arguments should be mandatory):

```
def parse_args():
"""Parse command line arguments"""
parser = optparse.OptionParser(description='run linear regression method')
parser.add_option('-d', '--data_filename', type='string', help='path to' +\
' CSV file of data')
(opts, args) = parser.parse_args()
mandatories = ['data_filename']
for m in mandatories:
if not opts.__dict__[m]:
print('mandatory option ' + m + ' is missing\n')
parser.print_help()
sys.exit()
return opts
```

For both train and test data, it will be useful to have `X`

(the features) separated from `y`

(the label). Devise a way to accomplish this using `numpy`

array slicing.

To start, we’ll just consider two classes, but here we have 10. We’ll get to such problems later, but for now, devise a way to convert the label to binary based on the user’s input digit. For example, if that value is 5 then:

- the label for all non-5’s should be 0
- the label for all 5’s should be 1

Using the Logistic Regression documentation, fit a Logistic Regression model to the *training* data. You may need to increase the number of iterations for gradient descent.

Similarly, using the Naive Bayes documentation, fit a Naive Bayes model to the training data. Think carefully about the features when deciding which type of Naive Bayes model to use.

It is *required* to have a helpful function that takes in a classifier (along with training data, test data, and optionally other arguments) and runs the model fitting and evaluation process. This is to avoid duplicated code. When thinking about how to design this helpful function (or multiple helper functions), imagine you were training and comparing 10 different classifiers!

For each classifier, *predict* the labels of the test data. Then compute and print the accuracy, following the rounding format below:

```
Naive Bayes accuracy: XX.XX%
Logistic Regression accuracy: XX.XX%
```

You are welcome to use the accuracy score from `sklearn`

.

Following the confusion matrix documentation create and print a confusion matrix for each algorithm. To visualize these results, create a figure for each confusion matrix and save them as following (you may need to create the `figs`

folder):

```
figs/log_conf_mat.pdf
figs/nb_conf_mat.pdf
```

See the confusion matrix visualization documentation for more information. The confusion matrix must be *normalized* to sum to 1 across the rows (true labels), with axis labels and a title. Think about color schemes that follow good visualization principles from CS260.

Finally, create a *single plot* with *two* ROC curves comparing Logistic Regression and Naive Bayes. See the CS260 ROC curve lab for a review of this topic.

You are welcome to use the `roc_curve`

function from `sklearn`

here, but the plotting should be done on your own with `matplotlib.pyplot`

. Regardless of how you compute the false positive and true positive rates, you will likely need to use the method `predict_proba`

within each classifier, which returns the prediction probabilities for each class.

Save your plot as shown below. You should have axis labels, a legend, and a title.

`figs/roc.pdf`

Make sure to fill out the `README.md`

with answers to all the **analysis questions**! (listed below as well) Be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded.

For all 10 digits (0-9), list the accuracies you obtained for Logistic Regression and Naive Bayes.

Based on all your evaluation metrics (accuracy, confusion matrix, ROC curve) which method was better overall? Explain your reasoning.

*Credit: based on Chapter 3 of “Hands-On Machine Learning” by Aurélien Géron*

If you do any of the extensions, document them in your `README.md`

.

Extend these algorithms and your analysis to a multi-class setting (i.e. distinguish between 3 or more digits). Can we use a ROC curve in this case? A confusion matrix? Push any additional figures and describe your results in your

`README.md`

.Experiment with changing the hyper-parameters of both Logistic Regression and Naive Bayes. How much does this change your results? After informal hyper-parameter optimization, which algorithm performs better?