Homework 7: Huffman Coding
Due: Wednesday, Mar. 30, 11:59pm
For this assignment, you will create a program that can read and write text expressed using a variable bitrate encoding. Such schemes can be used to compress files so that both storage and transmission consume fewer resources.
The canonical variable bitrate algorithm is called Huffman coding after its inventor, David Huffman, who as a student at MIT came up with the technique and proved its optimality in response to a homework assignment from his professor. You can read more about the algorithm in its Wikipedia article. Basically, more common characters are encoded using short bit strings, while rarer characters use much longer bit strings. The savings on the short, common characters more than makes up for the extra bits for the rarely occurring longer characters.
To avoid confusion, Huffman coding uses a prefix code scheme, meaning that no character's encoding forms the beginning sequence of any other. So for example, if the code 00 represents the letter e, then no other code may begin with 00 -- they must all use 01, 10, or 11 instead. This ensures that the encoded bit string has a unique interpretation. It also means that the encoding scheme can be conveniently represented using a binary tree, with symbols at the leaves, and the corresponding code for that symbol determined by the path to it from the root. By convention, left branches correspond to a 0 in the code, and right branches to a 1. So if the leaf node for the letter t is reached by going left-right-left from the root, then the code for t would be 010.
You will write three files and two separate programs for this assignment. One program will take a text file and encode it using the coding below. The second program will take an encoded file and decode it. (If you encode a file and then decode it again, you should get exactly the same file back.) Both programs will use a third file, HuffTree.java, that defines the coding tree and methods for using it. This third file should be made as general as possible -- imagine that you are going to use it for lots of different programs involving Huffman coding, and you want it to be useful in all cases.
For simplicity, your encoded files will record each bit as a character '0' or '1'. This makes the encoded files easy to read in a text editor. However, it does not decrease the file size since each bit is being represented by an 8-bit ASCII character! We will overlook the size issue in favor of convenience. If you really wanted to use a Huffman encoding to save space, you would have to manipulate individual bits and read and write the encoded data as a binary file, but that's not necessary for this assignment.
Shown here are the codes you should use for this assignment. Rather than building the coding tree from a particular input as described in the Wikipedia article, your program should build a tree that represents the encoding given here. (But see the optional extra credit assignment if you want to build a custom tree.) These details, since they are specific to this assignment, should not be a default hard-coded into the HuffTree constructor. Rather, you can make an extra method called sampleTree that creates a tree with these values.
Symbol | Code |
---|---|
e | 000 |
(space) | 110 |
s | 0010 |
h | 0011 |
i | 0100 |
n | 0110 |
o | 0111 |
a | 1001 |
t | 1010 |
l | 10000 |
d | 10111 |
r | 11111 |
p | 010100 |
, | 010101 |
y | 010110 |
g | 100010 |
f | 100011 |
w | 101101 |
m | 111000 |
c | 111001 |
(newline) | 111011 |
u | 111100 |
v | 1011000 |
. | 1110100 |
b | 1110101 |
" | 11110110 |
k | 11110111 |
- | 010111101 |
P | 010111111 |
A | 101100100 |
T | 101100101 |
' | 111101000 |
I | 111101010 |
j | 0101110001 |
z | 0101110010 |
q | 0101110011 |
W | 0101110111 |
S | 0101111000 |
R | 0101111001 |
? | 0101111100 |
M | 0101111101 |
B | 1011001101 |
N | 1011001110 |
x | 1011001111 |
! | 1111010010 |
H | 1111010111 |
V | 01011100000 |
; | 01011100001 |
K | 01011101000 |
Y | 01011101001 |
G | 01011101011 |
O | 10110011000 |
F | 11110100110 |
D | 11110100111 |
C | 11110101100 |
E | 11110101101 |
( | 010111010101 |
) | 010111011000 |
X | 010111011001 |
L | 010111011010 |
: | 101100110011 |
* | 0101110101000 |
J | 0101110101001 |
1 | 0101110110111 |
2 | 01011101101100 |
0 | 01011101101101 |
8 | 10110011001000 |
U | 10110011001011 |
Z | 101100110010011 |
7 | 1011001100100100 |
5 | 1011001100100101 |
6 | 1011001100101000 |
3 | 1011001100101001 |
/ | 10110011001010100 |
9 | 10110011001010101 |
Q | 10110011001010110 |
4 | 101100110010101111 |
[ | 10110011001010111011 |
# | 101100110010101110000 |
] | 101100110010101110001 |
% | 101100110010101110010 |
= | 101100110010101110011 |
@ | 101100110010101110100 |
$ | 101100110010101110101 |
This assignment involves a lot of work with strings. It is possible to do everything working just with the String class, but you may wish to check out StringBuilder, which offers a bit more efficiency.
I would suggest making HuffTree a subclass of BinaryTree from the lab, so you can use the existing methods.
HuffEncode and HuffDecode should each take in one command line argument: the path to a text file. To read in files, you can again use the Scanner class. Here is an example:
Scanner input = new Scanner(new FileReader(args[0]));Try creating a small text file with a few sentences to start.
As a reward for completing the decoding program, here are two secret files that you will only be able to view when you have completed the decoding program: one and two. To get the last one to work out, your program will have to handle newlines properly.
For extra credit, you can write and demonstrate a method that will build a Huffman coding tree automatically from the distribution of characters in some input file. Note that the exact set of characters in the file will not be known in advance. See this Letter frequency article for some starting values.