Homework 7: Hidden Markov Models
Due: Tuesday, Nov 29, 11:59pm on Moodle
In this homework, we will practice using HMMs to infer hidden states for a text-based application. We will see how to use an HMM package in Python - this framework could be used later on for your own data analysis.
Unfortunately, the HMM package in sklearn has been deprecated and moved to its own project. However, we can still use this software. First go to the github page here:
hmmlearn. From the terminal, run the command described in the readme:
pip install -U --user hmmlearn
If this doesn't work, you can also download the code as a zip file and run.
python setup.py --help # see what commandline options are available
python setup.py install # install the package
The documentation for an HMM with multinomial (discrete) emissions can be found below. This is the style of HMM we have been discussing in class, and what we will use in this homework.
Download the dataset below, which contains 50,000 English words.
Open this file in a text editor. Each one of these words will be a separate (independent) input to our HMM. We will consider each letter to be an observation. It is not yet clear what the hidden state represents, but we will try to make sense of the results later on.
Create a Python program, and then open this file using the following style of code:
word_file = open(word_filename,'r') # the 'r' is for read (vs. write a file)
for line in word_file: # line is a string
# do something here
word_file.close()
To analyze this data with the Python HMM program, we must first create one long list of all the letters (mapped to integers 0-25) in all the words. We will just keep the letters, ignoring punctuation or whitespace. Create this list of integers, along with a list of the lengths of all the words (this tells the HMM when to start a new sequence, but all the sequences will share the same parameters). Here is an example:
# word list file
a
the
# after pre-processing we get this input to the HMM:
all_letters = [0,19,7,4] # a is 0, t is 19, etc
lengths = [1,3] # the first word has length 1, second length 3
Note that the sum of lengths should be equal to the length of all_letters.
Now we will fit an HMM to this dataset. First create an instance of a multinomial HMM:
model = hmm.MultinomialHMM(...., random_state = 12) # think about the arguments
Use k=2 components (hidden states) for this HMM. We will use the random seed 12 to initialize the transition, emission, and initial probabilities. (This is engineering the results slightly, but I wanted it to work out!) Then fit the model.
For the last part, print out the resulting transition, emission, and start probabilities. In a separate reflection document (readme.txt), include these three sets of parameters and any other printed output from your program.
Then in your code, for each letter, print out which was higher, state 0 or state 1. Group all the letters with state 0 and all the letters with state 1. What do you notice about these two groups? Provide an interpretation in your readme. Given this interpretation, what do the start probabilities mean? What about the transition probabilities?