CSC 212: Programming with Data Structures

Homework 7: Huffman Coding

Due: Wednesday, Mar. 30, 11:59pm

Credit for this assignment: Nick Howe


Overview

For this assignment, you will create a program that can read and write text expressed using a variable bitrate encoding. Such schemes can be used to compress files so that both storage and transmission consume fewer resources.

The canonical variable bitrate algorithm is called Huffman coding after its inventor, David Huffman, who as a student at MIT came up with the technique and proved its optimality in response to a homework assignment from his professor. You can read more about the algorithm in its Wikipedia article. Basically, more common characters are encoded using short bit strings, while rarer characters use much longer bit strings. The savings on the short, common characters more than makes up for the extra bits for the rarely occurring longer characters.

To avoid confusion, Huffman coding uses a prefix code scheme, meaning that no character's encoding forms the beginning sequence of any other. So for example, if the code 00 represents the letter e, then no other code may begin with 00 -- they must all use 01, 10, or 11 instead. This ensures that the encoded bit string has a unique interpretation. It also means that the encoding scheme can be conveniently represented using a binary tree, with symbols at the leaves, and the corresponding code for that symbol determined by the path to it from the root. By convention, left branches correspond to a 0 in the code, and right branches to a 1. So if the leaf node for the letter t is reached by going left-right-left from the root, then the code for t would be 010.


To Do

You will write three files and two separate programs for this assignment. One program will take a text file and encode it using the coding below. The second program will take an encoded file and decode it. (If you encode a file and then decode it again, you should get exactly the same file back.) Both programs will use a third file, HuffTree.java, that defines the coding tree and methods for using it. This third file should be made as general as possible -- imagine that you are going to use it for lots of different programs involving Huffman coding, and you want it to be useful in all cases.

For simplicity, your encoded files will record each bit as a character '0' or '1'. This makes the encoded files easy to read in a text editor. However, it does not decrease the file size since each bit is being represented by an 8-bit ASCII character! We will overlook the size issue in favor of convenience. If you really wanted to use a Huffman encoding to save space, you would have to manipulate individual bits and read and write the encoded data as a binary file, but that's not necessary for this assignment.

Shown here are the codes you should use for this assignment. Rather than building the coding tree from a particular input as described in the Wikipedia article, your program should build a tree that represents the encoding given here. (But see the optional extra credit assignment if you want to build a custom tree.) These details, since they are specific to this assignment, should not be a default hard-coded into the HuffTree constructor. Rather, you can make an extra method called sampleTree that creates a tree with these values.

SymbolCode
e000
(space)110
s0010
h0011
i0100
n0110
o0111
a1001
t1010
l10000
d10111
r11111
p010100
,010101
y010110
g100010
f100011
w101101
m111000
c111001
(newline)111011
u111100
v1011000
.1110100
b1110101
"11110110
k11110111
-010111101
P010111111
A101100100
T101100101
'111101000
I111101010
j0101110001
z0101110010
q0101110011
W0101110111
S0101111000
R0101111001
?0101111100
M0101111101
B1011001101
N1011001110
x1011001111
!1111010010
H1111010111
V01011100000
;01011100001
K01011101000
Y01011101001
G01011101011
O10110011000
F11110100110
D11110100111
C11110101100
E11110101101
(010111010101
)010111011000
X010111011001
L010111011010
:101100110011
*0101110101000
J0101110101001
10101110110111
201011101101100
001011101101101
810110011001000
U10110011001011
Z101100110010011
71011001100100100
51011001100100101
61011001100101000
31011001100101001
/10110011001010100
910110011001010101
Q10110011001010110
4101100110010101111
[10110011001010111011
#101100110010101110000
]101100110010101110001
%101100110010101110010
=101100110010101110011
@101100110010101110100
$101100110010101110101

This assignment involves a lot of work with strings. It is possible to do everything working just with the String class, but you may wish to check out StringBuilder, which offers a bit more efficiency.

I would suggest making HuffTree a subclass of BinaryTree from the lab, so you can use the existing methods.

HuffEncode and HuffDecode should each take in one command line argument: the path to a text file. To read in files, you can again use the Scanner class. Here is an example:


Scanner input = new Scanner(new FileReader(args[0]));
Try creating a small text file with a few sentences to start.

Testing

As a reward for completing the decoding program, here are two secret files that you will only be able to view when you have completed the decoding program: one and two. To get the last one to work out, your program will have to handle newlines properly.

Extra Credit

For extra credit, you can write and demonstrate a method that will build a Huffman coding tree automatically from the distribution of characters in some input file. Note that the exact set of characters in the file will not be known in advance. See this Letter frequency article for some starting values.

To Submit

  • HuffTree.java
  • HuffEncode.java to encode files
  • HuffDecode.java to decode files
  • readme.txt
  • typescript, showing encoding and decoding of a short test file
  • decode2.txt, your decoding of mystery file #2 above
  • HuffBuild.java, if you've done the extra credit problem. (Also include a demonstration of this program in your typescript.)