CS260 Lab 3: Gradient Descent

Due: Tuesday, September 21 at 11:59pm

Overview

Goals:

To become more familiar with vector and matrix operations in code
To investigate two different approaches to fitting regression models
To interpret the resulting linear models

Credit: Based on materials created by Allison Gong and Jessica Wu.

`numpy` suggestions

Here are a few suggestions for working through this lab:

If you are seeing many errors at runtime, inspect your matrix operations to make sure that you are adding and multiplying matrices of compatible dimensions. Printing the dimensions of variables with the arr.shape command will help you debug.
When working with numpy arrays, note that numpy interprets the * operator as element-wise multiplication. This is a common source of size incompatibility errors. If you want matrix multiplication, you need to use the np.matmul function. For example, A*B does element-wise multiplication while np.matmul(A,B) does a matrix multiply.
Be careful when handling numpy vectors (rank-1 arrays): the vector shapes (1,n), (n,1), and (n,) are all different things. For simplicity in this lab (unless otherwise indicated in the code), both column and row vectors are rank-1 arrays of shape (n,), not rank-2 arrays of shape (1,n) or (n,1).

Useful numpy functions:

np.array([...]): takes in a list (or list-of-lists, etc) and turns it into a numpy array. Note that after this step, you have to use numpy operations to work with the resulting object.
np.dot(u,v): compute the dot project of two vectors u and v with the same length. Note that this function also works as matrix multiply, but I recommend the following function for clarity.
np.matmul(A,B): multiply the matrices A and B. The inner dimensions must match.
np.ones((p,q)): creates a 2D array of 1’s with the given shape. Similarly, you can use np.zeros((p,q)) to create an array of 0’s. If you include only one dimension (i.e. np.zeros(p)), this will yield a vector.
np.concatenate((A, B), axis=0): concatenate two arrays along the given axis. Here it would be along the rows (axis 0), so A would be “on top”, and B below. In this case A and B do not have to have the same number of rows, but they do have to have the same number of columns. If we concatenate along axis 1 then A would be on the “left” and B would be on the right (and visa versa about the matching dimensions).
np.reshape(A, (p,q)): will reshape array A to have dimensions (p,q) (or whatever dimensions you need, provided the values of A can actually be reshaped in this way).
np.linalg.pinv(A): returns the pseudo-inverse of matrix A (we will use this just in case we have issues with a singular matrix).

Part 1: Reading in the data

Accept your Lab 3 repository on Github classroom. You should have the following folders/files:

linear_regression.py: this is where your code will go
data/: folder with three different datasets

Two of these datasets we have seen before. The USA_Housing dataset is from here and has several features that potentially have an impact on housing prices. In our code we will use the cleaned version of this dataset, but the original is included so you can see the names of the features.

For all these datasets, the features are the first columns and the output (response variable) is the last column.

In this lab we will be using command line arguments. This will allow us to specify the dataset on the command line, as opposed to hard-coding it within the program. As a result, we can change the dataset easily.

Here is example code for command line arguments in Python. Here we have one mandatory argument data_filename.

def parse_args():
    """Parse command line arguments"""
    parser = optparse.OptionParser(description='run linear regression method')

    parser.add_option('-d', '--data_filename', type='string', help='path to' +\
        ' CSV file of data')

    (opts, args) = parser.parse_args()

    mandatories = ['data_filename']
    for m in mandatories:
        if not opts.__dict__[m]:
            print('mandatory option ' + m + ' is missing\n')
            parser.print_help()
            sys.exit()

    return opts

Then in main, we can use this function like this:

opts = parse_args()
filename = opts.data_filename

filename is now a string that we can use later on (i.e. to open the file). On the command line, we can now write:

$ python3 linear_regression.py -d data/sea_ice_data.csv

Double check that this all works correctly on your system, and read the data into a numpy array. Separate X and y into separate variables using numpy array slicing (no loops!)

Part 2: Analytic solution to linear regression

In linear regression, our objective is to minimize the cost function:

Where our linear model is a dot product of the weights and the features of an example (with a “fake” 1 added to the front of each x).

In class, we learned that the closed-form solution to linear regression is

First we will start out by implementing this solution in the function:

def fit(X, y):

This should return the optimal weights w (i.e. the model parameters). You may want to add a helper function that adds the column of ones to the X matrix (again using no loops).

Test your analytic method on the sea_ice data and the regression_train data. In your README.md, comment on how close your results are (for w) to the provided models from Lab 2. Make sure your code clearly prints out the weights found by the analytic solution (but no specific format is needed).

NOTE: there should be no for loops in your code! See below for two exceptions in gradient descent.

Part 3: Gradient descent solution to linear regression

In this part of the lab we will work with the USA_Housing dataset, as the sea_ice dataset is a bit too small to work well with stochastic gradient descent. Use the steps below as a guide.

In this dataset, the magnitudes of the features are very large, and some of the features are on different scales. This makes gradient descent difficult, since it requires all the weights to converge at roughly the same time. To work around this, we will normalize the data by subtracting off the mean and dividing by the standard deviation.

First compute the mean and standard deviation for the features (across all examples on the rows, i.e. axis=0).

X_mean = X.mean(axis=0)
X_std = X.std(axis=0)

Then subtract off the mean and divide by the standard deviation to get normalized X values that we will use throughout the rest of the code. Then do the same for the y values.

Implement the functions predict (which should return a vector of predicted y values given a matrix X of features and a vector of weights w) and cost (which should return the value of the cost/loss function J discussed in class). Use your cost function to compute the cost associated with the weights from the analytic solution. This should be roughly the minimum of the cost function J, and will serve as a goal/baseline when implementing gradient descent. (Hint: this value should be around 205.)
Now we will implement stochastic gradient descent in the function:

def fit_SGD(X, y, alpha=0.01, eps=1e-10, tmax=10000):

Here alpha is the step size, eps is the tolerance for how much J should change at convergence, and tmax is the maximum number of iterations to run gradient descent. These are the default values, but throughout you’ll need to experiment with these values to ensure efficient convergence. tmax is more of a stop-gap if the method is not converging - it should ideally not be used as a way to stop the algorithm.

During each iteration of SGD, we go through each x_i in turn and update the weights according to the gradient with respect to this x-value.

The goal of this function is to return the final weights after the cost function has converged. To check for convergence, after each iteration of SGD, compute the cost function and compare it to the cost function from the previous iteration. If the difference is less than epsilon (eps), then break out of the loop and return the current weights. You should be able to achieve a cost function value similar to that of the analytic solution.

NOTE: it is okay here to have one for loop over the SGD iterations and one for loop over the examples. Also note that you do not need to randomize the examples for this lab, as would be typical in SGD.

At the end of this step, when run your code should clearly print out:

the analytic cost
the analytic weights
the alpha, eps, and tmax values you used for SGD
the SGD cost
the SGD weights
the number of iterations it took for SGD descent to converge

Make sure in your printout that each value is clearly labeled.

Finally, as a debugging and visualization tool, create a plot of the cost function J as the number of iterations increases (so the SGD iteration should be on the x-axis and the cost should be on the y-axis). If you see that the cost is increasing, you may need to change your alpha value or debug your algorithm. Save your final plot as:

figures/cost_J.pdf

Part 4: Interpretation

In your README.md, include the optimal weights from both your analytic solution and your SGD solution. Comment on any differences between the two models.
For each model, what was the most important feature? Which features were essentially not important? Do these results make sense to you? Does it matter that we normalized the features first? (include your answers in your README.md)

Final checks:

Make sure you’ve followed proper Python style (see Lab 1 for more details).
Double check your repository online to make sure all figures and code are pushed correctly.
Make sure your README.md is completely filled out. Points will be lost for not filling out the last section.

Optional extension: polynomial regression

In Lab 2 we saw not only linear models, but polynomial models with degrees up to 10. For datasets with a single feature (like sea_ice and regression_train) the machinery of multiple linear regression can actually be used solve polynomial regression.

The idea is to create a matrix X where column k represents x to the power k. Here is what this might look like for a degree d polynomial:

After creating this matrix, we can now run multiple linear regression as usual to find the model, which consists of the coefficients of the polynomial (as opposed to the coefficients of each feature). Try this out and see if you can come up with the polynomial models provided in Lab 2.