CS360 Lab 5: Advanced Logistic Regression and Fairness Regularization

Due: Thursday, February 29 at 11:59pm


Overview

The goals of this week’s lab:

In this lab we will be analyzing the “Census Income” dataset, where the goal is to predict income from census attributes. We will think about this dataset in a slightly different context, where we are trying to ensure fairness based on sex. Race is also included in the dataset, but we will not consider this feature for now. Using our notation from class:

A = 1 (female, protected class) A = 0 (male, unprotected class) Y = 1 (income >= 50k) Y = 0 (income < 50k)

This lab may optionally be done in pairs (or you can work individually).



Getting Started

Find your git repo for this lab assignment the lab05 directory. You should have the following files:



Usage and I/O

Usage

Your programs should take in the same command-line arguments as Lab 4 (feel free to reuse the argument parsing code), plus a parameter for the learning rate alpha and a parameter for the maximum number of SGD iterations t. For example:

python3 run_LR_fairness_regularizers.py -r data/adult/adult.data.cleaned.csv -e data/adult/adult.test.cleaned.csv -a <learning rate> -t <max iter>

Program Inputs

To simplify preprocessing, you may assume the following:

FEATURE_NAMES = ["age", "education-num", "capital-gain", "capital-gain", "capital-loss", "hours-per-week"]

Program Outputs


Logistic Regression

You will implement the logistic regression task discussed in class for binary classification. This is similar to Lab 7 from CS260 and you’re welcome to reuse your code (all versions of CS260 did quite a bit with logistic regression, so feel free to use any of your code from CS260).

Step 1: Training without regularization (cost function)

To learn the weights, we will apply stochastic gradient descent as discussed in class until the cost function does not change in value (very much) for a given iteration. As a reminder, our cost function is the negative log of the likelihood function, which is:


Step 2: Stochastic Gradient Descent

Our goal is to minimize the cost using SGD. Pseudocode:

initialize weights to 0's
while not converged:
    shuffle the training examples
    for each training example xi:
        calculate derivative of cost with respect to xi
        weights = weights - alpha*derivative
    compute and store current cost

The SGD updates for w term are:


The hyper-parameter alpha (learning rate) should be sent as a parameter to your SGD and used in training. A few notes for the above:

Step 3: adding in regularization

Now we will add in fairness regularization, following our discussion in class. For the cost and gradient descent functions, this will require two additional parameters:

Fairness regularization parameters:

Make sure to change both the cost and the gradient in your algorithm.

Step 4: confusion matrices

To evaluate the impact of adding fairness regularization, we will create train 3 models:

Based on these three models, create 6 confusion matrices (i.e. the three models, but with the results separated by sex). Save these as PDF files in a figs folder.

In addition make sure to print (for each model)


Analysis

(See README.md to answer these questions)

  1. What do you notice about the 6 resulting confusion matrices and how they compare?

  2. Which model version would you choose if your goal was to identify high earners for an IRS tax audit? Why?

  3. What about if your goal was to determine what someone should be paid? Why?