Lab 0: Setting up Python
in-class
http://statweb.stanford.edu/~tibs/ElemStatLearn/
Open the training data and testing data using a text editor. Both datasets have the same format: the first column is the "label" (here an integer between 0 and 9, inclusive, that corresponds to the identity of a hand-written zip code digit), and the rest of each row is made up of gray-scale values corresponding to the image of this hand-written digit.
The next step is to make sure you have the right software installed. We'll be using Python, including the packages numpy, scipy, matplotlib, and sklearn. The easiest way to get these all at once is to use the distribution below:
To make sure this worked, after Canopy opens, type:
import numpy
import scipy
import matplotlib
at the prompt and make sure there are no errors. For the last package,
we need to do something different. In Canopy, go to the Tools tab and
select Package Manager. Then select the Available tab at the left, and
choose "scikit_learn". Make sure this worked by typing:
import sklearn
Now we'll load the data in python. Open a new file in Canopy and first
import the necessary libraries:
import numpy as np
Then we'll load the data using:
train_data = np.loadtxt("path/to/train/file")
This will create a numpy array from the data (very convenient!) Test
this by printing "train_data". Make sure the data format makes
sense. You can also import the test data in this way.
To start, we'll just consider two classes, but here we have 10. We'll get to such problems later, but for now, only retain the rows which have label 2 or 3 using a for loop. Do this for both the train and test data to create new numpy arrays (you can make a list first).
One important note: relabel the 2's to 0's and the 3's to 1's, since this will work better with the methods later on.
Credit: based on Exercise 2.8 from "The Elements of Statistical Learning"