Sara Sheehan

CSC 212: Programming with Data Structures

Homework 8: Hashing

Due: Wednesday, April 6, 11:59pm

Credit for this assignment: Nick Howe

For this assignment, you are welcome to use pair programming (with anyone in the class, even if you have worked together before). Also, for this assignment, everyone should post their readmes publically (non-anonymously) on Piazza, addressing the following questions about the class so far:

How has your view of programming changed throughout the class?
How has your view of data structures changed throughout the class?
Has your view of programming as an art changed? Are there any examples of why or why not?
Any other comments/questions.

The purpose of this assignment is to give you some experience with hashing. For this project, you will implement a spell-checking program that stores its dictionary in a hash set (implemented in the Java core by class HashSet). This is similar to a hash table in that the keys are hashed, but there is no associated value. Afterwards, you will compare the performance of your program when a TreeSet or ArrayList is substituted for the HashSet.

Specification

Your program should run in two modes. If invoked with command line arguments, it should check the spelling of each argument. If invoked without command line arguments, it should read from the standard input until end-of-file, extracting individual words using a Scanner and checking their spelling.

When checking the spelling of an individual word, the program should do nothing if the word is spelled correctly. If the word is misspelt, your program should offer suggestions of the correct spelling. It will do this by constructing all possible "near misses" and returning any that are in the dictionary, without duplicates. You should consider the following near misses:

Deletions: Delete one letter from the word. (n possibilities for a word of length n)
E.g.: catttle -> cattle
Insertions: Insert one letter into the word at any point. (26*(n+1) possibilities for a word of length n)
E.g.: catle -> cattle
Substitutions: Replace one character with another. (25*n possibilities for a word of length n)
E.g.: caxtle -> cattle
Transpositions: Swap two adjacent characters. (n-1 possibilities for a word of length n)
E.g.: cattel -> cattle
Splits: Divide the word into two legal words. (n-1 possibilities for a word of length n)
E.g.: cattell -> cat tell

While debugging, you will probably want to have your program print out all the near-miss candidates it creates for some short (two-letter) nonsense word. Inspect this list carefully to make sure all desired candidates are present and correct, and there are no duplicates.

Your program should ignore capitalization and punctuation. You can do this by converting all words to lower case before checking their spelling, and by making sure that the Scanner skips over punctuation marks. Finally, your program should print a message about misspelling a given word just once, even if the same misspelling appears several times. Do this in as efficient a way as you can.

Details

You should write two classes, SpellDictionary and SpellChecker. SpellDictionary will read in the dictionary file (described below) and handle checks for words and near misses. SpellChecker will contain the main() method and perhaps a few others to run the interface. This is a good opportunity to practice what you have learned about class design. Try to put methods in logical places based upon the roles of the classes. Think of SpellDictionary as a class that might be used by other programs -- a word processor, for example, or an email program -- so try to keep it general, while still offering useful and powerful methods. Your SpellChecker program is one specific application. If you design SpellDictionary well the your SpellChecker will be simple to write, but you don't want SpellDictionary catering too much to SpellChecker's specific implementation.

The Linux system dictionary is at /usr/share/dict/words. This is a text file containing 99171 words (at last count), one per line. Your program should read these in and store them in a HashSet structure. Building the dictionary may take some time, but once it is built the spell checking should go quite quickly.

Experiments

Once you have your program working, you will want to test it out. Here's a sonnet in Shakespeare's original spelling that would make a good candidate:

From fairest creatures we desire increase,
That thereby beauties Rose might neuer die,
But as the riper should by time decease,
His tender heire might beare his memory:
But thou contracted to thine owne bright eyes,
Feed'st thy lights flame with selfe substantiall fewell,
Making a famine where aboundance lies,
Thy selfe thy foe,to thy sweet selfe too cruell:
Thou that art now the worlds fresh ornament,
And only herauld to the gaudy spring,
Within thine owne bud buriest thy content,
And tender chorle makst wast in niggarding:
Pitty the world,or else this glutton be,
To eate the worlds due,by the graue and thee.

Your spell checker provides a good opportunity to investigate the practical differences between different lookup structures. Make two new versions of your program, the first using a TreeSet instead of a HashSet, and the second using an ArrayList. These should be relatively transparent substitutions, since the important method names are the same. Check the running time of all three variations, and note the results in your readme file. Keep in mind that it takes some time just to build the dictionary, so you really should run two tests for each program: one with an empty file so you can see how long just the dictionary-building takes, and another where you spell-check the dictionary itself so you can see what the performance is like on a large file. The time difference between the two runs is the amount spent spell-checking. Compare these numbers for the three different data structures. Do the results match your expectations?

To Submit

SpellDictionary.java
SpellChecker.java
typescript, showing the program compiling and running
Piazza post (I'll make a thread for this)
readme.txt, summarizing what you learned, including the results of your timing experiments.