CSC 390: Topics in Artificial Intelligence

Lab 0: Setting up Python

in-class

The goal of this lab it to set up your work environment and get used to working with large datasets.

Step 1: Get the Data

First download the "zip code" data from the "Data" tab here:

http://statweb.stanford.edu/~tibs/ElemStatLearn/

Open the training data and testing data using a text editor. Both datasets have the same format: the first column is the "label" (here an integer between 0 and 9, inclusive, that corresponds to the identity of a hand-written zip code digit), and the rest of each row is made up of gray-scale values corresponding to the image of this hand-written digit.

Step 2: Get the Code

The next step is to make sure you have the right software installed. We'll be using Python, including the packages numpy, scipy, matplotlib, and sklearn. The easiest way to get these all at once is to use the distribution below:

Python from Enthought Canopy

To make sure this worked, after Canopy opens, type:

import numpy
import scipy
import matplotlib
at the prompt and make sure there are no errors. For the last package, we need to do something different. In Canopy, go to the Tools tab and select Package Manager. Then select the Available tab at the left, and choose "scikit_learn". Make sure this worked by typing:
import sklearn

Step 3: Load the Data

Now we'll load the data in python. Open a new file in Canopy and first import the necessary libraries:

import numpy as np
Then we'll load the data using:
train_data = np.loadtxt("path/to/train/file")
This will create a numpy array from the data (very convenient!) Test this by printing "train_data". Make sure the data format makes sense. You can also import the test data in this way.

Step 4: Filter the Data

To start, we'll just consider two classes, but here we have 10. We'll get to such problems later, but for now, only retain the rows which have label 2 or 3 using a for loop. Do this for both the train and test data to create new numpy arrays (you can make a list first).

One important note: relabel the 2's to 0's and the 3's to 1's, since this will work better with the methods later on.

Credit: based on Exercise 2.8 from "The Elements of Statistical Learning"