CS66 Lab 2: Decision Trees

Due: Tuesday, February 12 Thursday, February 14 at 11:59pm


Overview

The goals of this week’s lab:

You must work with one lab partner for this lab. You may discuss high level ideas with any other group, but examining or sharing code is a violation of the department academic integrity policy.

Note that I will not be in lab January 30, but I still strongly recommend that you begin in lab with your partner. During lab on Feb 6, I will go around to each group for an intermediate “check-off”. At that point you should have at least designed your data structures and implemented a top-down design for your entire program (and have both partners be able to explain it to me!)


Getting Started

Find your git repo for this lab assignment in the cs66-s19 github organization.

Clone your git repo with the starting point files into your labs directory:

$ cd cs66/labs
$ git clone [the ssh url to your your repo]

Then cd into your lab02-id1_id2 subdirectory. You will have the following files:


Part 1: Implement a Decision Tree

You will implement an ID3-like decision tree in this part of the lab. The main features of your implementation include:

Usage

Your program should take in 1 or 2 command-line arguments in this specific order:

  1. The name of the data set
  2. A parameter setting for the maximum depth of the learned tree (optional)

For example:

python3 run_dtree.py heart 3

If the user specifies an incorrect number of arguments, or invalid values for the parameters print a usage statement and exit the program. You should error check all parameters to ensure that the files exist and that given parameters are integers greater than 0. E.g.,

$ python3 run_dtree.py
Error, incorrect number of arguments
Usage: python run_dtree.py dataset [depth]
  dataset - base name of input arff files in the input/ folder
  depth (optional) - maximum depth of tree, set as a positive integer value

Note that the dataset name (e.g., toy) will determine 4 file names:

You must ensure that the two input files actually exist. Please follow this format exactly; these are easy points for your grade and they significantly reduce grading issues.

Program Style Requirements

Program Inputs

Program Main Behavior

Your program should process inputs, parse and index training and test examples, and induce a decision tree. There are very few restrictions on how you do this, but be sure your tree fits the behavior from class:

Program output

When run, your program should output the following:

Accuracy: 0.70 (35 out of 50 correct)
[9 5]
Outlook=Sunny [2 3]
|   Humidity=High [0 3]: No
|   Humidity=Normal [2 0]: Yes
Outlook=Overcast [4 0]: Yes
Outlook=Rain [3 2]
|   Wind=Strong [0 2]: No
|   Wind=Weak [3 0]: Yes

Example runs

Heart disease data set:

Diabetes data set:


Part 2: Analysis

UPDATE: the plotting part is now optional (you should still run various depths for both datasets and include your observations in your README, but you don’t need to make a plot).

Once your lab is complete, generate two learning curve graphs (i.e., line graphs) with an accompanying analysis of the graphs. On the x-axis, show the depth of a tree. On the y-axis, show the accuracy of the tree on both the training set and test set. Run experiments at various depth levels including, at a minimum, depths of 1, 2, 4, 7, and 10. You should have one graph for the diabetes data set and one graph for the heart disease data set.

Describe what you can conclude from each graph. What happens to the training set accuracy and test set accuracy as the depth varies? Is this what you expected? Place your graphs and analysis in the README.md. Be sure that your graphs are of scientific quality - including captions, axis labels, distinguishable curves, and legible font sizes. You can include a figure in Markdown using:

![Figure title](figure.png)

Here is an example of creating a plot using matplotlib, but you are welcome to use a different program for this part.

# import plotting library
import matplotlib.pyplot as plt

# set up sequences of x and y values (make sure they are of the same length)
x = range(10)
y = [val**2 for val in x]

# plot the (x,y) values (considered as pairs), using blue (b) circles (o),
# connected by lines (-)
plt.plot(x,y,'bo-')

# title, axis labels, and legend
plt.title("My Plot")
plt.xlabel("x-axis label")
plt.ylabel("y-axis label")
plt.legend(["line 1"])

# I usually use "show" at first to see the plot, then save it later
plt.show()
#plt.savefig("my_plot.png", format='png')
Example Plot
Example Plot

Extensions (Optional)

Multi-class prediction

Both data sets I have provided have two class labels. Decision trees easily extend to multi-class prediction. Find a relevant data set (e.g., at Kaggle or the UC Irvine Machine Learning Repository) and evaluate your algorithm with multiple discrete values allowed for labels.

Min Leaf Size

Another approach to prevent overfitting is setting a minimum number of examples in a leaf. If a split causes results in children below the threshold, the recursion is stopped at the current node and a leaf is created with the plurality label. This is often more flexible than maximum depth - it allows variable depth branches while still trying to prevent overly detailed hypotheses that only fit to a few examples. Add minimum leaf size as an optional command line argument.

Learning curves for training set size

Machine learning algorithms are often sensitive to the amount of training data - some algorithms thrive when there is large amounts of data but struggle when data is sparse (e.g., deep neural networks) while others plateau in performance even if more data is available. Evaluate your decision tree on the given training data by randomly subsampling the data to create smaller training sets e.g., use 10% of training data. Choose at least 5 training set sizes and plot them on the x-axis with the y-axis describing the accuracy. You’ll probably want to run each training set size 3-5 times and average the results. Describe what you see when you compare training and test accuracy as the training set grows.


Submitting your work

For the programming portion, be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded. Only files in the main directory will be graded. Please double check all requirements; common errors include:

Acknowledgements

Credit for this lab: Ameet Soni.

Both data sets were made available by the UC Irvine Machine Learning Repository. The heart disease data set is the result of a study by the Cleveland Clinic to study coronary heart disease. The patients were identified as having or not having heart disease and the resulting feature set is a subset of the original 76 factors that were studied. The diabetes data set is the result of a study of females at least 21 years old of Pima Indian heritage to understand risk factors of diabetes. Both data sets were converted to ARFF format by Mark Craven at the University of Wisconsin.