Homework 2: Clustering
Due: Wednesday, Sept 28, 11:59pm on Moodle
The goal of this assignment is to implement a hierarchical clustering algorithm called UPGMA, and use it on a real-world example. For this assignment, you are welcome to use pair-programming, as long as you work on the code only together, on one computer, switching the "driver" every 30 minutes. If you do pair-programming, make sure to do the extra part at the end of this assignment. If you prefer to work alone, this part is extra credit.
Download the starter code hw2.py. In this file is a skeleton for UPGMA. The first step is to implement
def minDist(distance):which should take in a distance matrix and return a tuple (i,j) of the non-diagonal cell with the minimum distance (i.e. we will merge the ith and jth clusters first).
The next step is to implement
def mergeClusters(clusters, distance):There is a step-by-step outline in the code, but you can do this however you like. The output should be a set of clusters that has one fewer cluster than the input, plus a square distance matrix that had one fewer row and column.
The last step of the algorithm is to iteratively call mergeClusters in a loop, until there are no more clusters to merge. At each iteration of the algorithm, print out the clusters and the distance matrix.
Finally, for the three examples in the code, use the output of your program to draw out a tree for each example, labeling all the branch lengths. You can take a picture of your work, scan it, or draw out the tree in a program like PowerPoint or ms paint.
Write a reflection (readme.txt) for this assignment. What concepts or tasks were the most challenging? How did this assignment go relative to last week? If you did pair-programming, what were the advantages/disadvantages?
If you did pair-programming, also complete the part below. If you worked on your own, you can do this for a small amount of extra credit.
Write another function to return the distance matrix that is induced (i.e. created) by UPGMA. In which cases is this matrix similar to the input matrix? In which cases is this matrix very different? Does this matrix say more about our data or our algorithm? Include your responses in your reflection.