CS 102: Lexicon

Lexicon

In this homework you develop a data structure to manage a lexicon, storing efficiently in memory all the words of a dictionary. All modern word processors spell checkers use such lexicon to suggest alternatives to misspelled words‡. We could reuse this data structure for another application that mimics a spell checker tool (talk to me if you are interested to do this).

The objectives of this homework are

to create a recursive tree structure,
to write efficient algorithms for accessing it,
to design and implement advanced functionality for a recursive tree structure and
to develop sophisticated and efficient traversal algorithms.

‡ This homework is based on an assignment originally created by Julie Zelenski, Justin Manus, and Mehran Sahami at Stanford University, modified by Jeannie Albrecht and Steve Freund at Williams College, and by James Teresco at Mount Holyoke College.

It is a violation to the Honor Code not to acknowledge others' work fully and openly. List all your collaborators and any person that helped you. You should cite any online resources you consulted to help you solve this project.
You may NOT directly copy any code or any text from anywhere.

Getting Started

First, carefully read this handout. You will have questions, so leave plenty of time to discuss this homework with the course staff.

You are encouraged to write a design document that describes your plan before you proceed with your implementation. Bring your design document when you come to ask questions: we will request to see it to help you better.

Copy the zip source code.
Unzip its content in a folder.
Link the bailey.jar to your program as you did before in DrJava or Eclipse.
Run the main method contained in LexiconTester.java.

The provided code contains a demo that uses bad implementations of a lexicon. Their code are included, using either a LexiconOrderedList object or a LexiconOrderedVector object. The demo purpose is to illustrate the power of class/interface design and to demonstrate the behaviour expected at the command line interface.

Your task is to implement a data structure, a LexiconTrie, that provides the same features more efficiently: running faster and using less memory. To run YOUR code in LexiconTester.java switch the commented lines as follow

//Lexicon lex = new LexiconOrderedList(); //Lexicon lex = new LexiconOrderedVector(); 
Lexicon lex = new LexiconTrie();

Inefficient Implementations

The provided program contains two functional, but inefficient, implementations of a lexicon, along with a tester program, LexiconTester.java, that for now uses the inefficient data structures.

Each bad solution implements the Lexicon interface. Your solution will also implement the Lexicon interface.

LexiconOrderedVector uses an OrderedVector.
LexiconOrderedList uses an OrderedList.

Note that most of the code for them is the same in both implementation and has been factored out into the LexiconOrderedStructure abstract class. Familiarize yourself with the provided code by looking at the definition of the interfaces and (abstract) classes. Don't study them in details. Large parts of the implementation of the LexiconTester class are fairly complex and interesting but unimportant in its detail for our purpose.

Focus your attention on the LexiconTester interaction with the Lexicon classes, i.e., run the main method of the LexiconTester class and experiment with the console interface to create, explore and manipulate a lexicon instance.

Word Lists

To initialize the lexicon structure automatically you are provided with three files that contain word lists.

Two are small collections of words chosen to test specific cases: small.txt and small2.txt.
The other, ospd2.txt, lists all 127,143 words in the second edition of the Official Scrabble Player's Dictionary.

Load the small examples to create a lexicon; modify the structure adding/removing words and displaying its content.

The Lexicon Trie

Your task is to implement the classes LexiconTrie.java and LexiconNode.java to manage a lexicon efficiently.
The LexiconTrie implements a special kind of tree called a trie. The term trie comes from re-trie-val. This term was coined by Edward Fredkin, who pronounces it /#tri#/ "tree". However, other authors pronounced it /#tra#/ "try", in an attempt to verbally distinguish it from "tree". The zip file contains a pdf describing formally a Trie: read it to understand the formal definition of a trie.

LexiconTrie.java implements the Lexicon interface provided in the starter code. Some methods are straightforward and intuitive, others are more difficult: the methods you are required to implement are the fundamental ones and are relatively simple; other methods are there for challenge. Do the latter to earn bonus points! For information about the usage and purpose of these methods, read the comments in Lexicon.java. For implementation details read below.

To make use of your LexiconTrie in the LexiconTester program, don't forget to change the call in LexiconTester to construct a LexiconTrie (instead of a LexiconOrderedList).

The Trie Structure

A trie is a letter-tree that efficiently stores strings (in our case, words). A node in a trie represents a letter. A path through the trie traces out a sequence of letters that represents a prefix or word in the lexicon.

Unlike a binary tree, each trie node has up to 26 child references (one for each letter of the alphabet). While a binary search tree eliminates half the words with a left or right turn at each step in a search, a trie eliminates 25/26 possible continuations as the search follows the child reference for the next letter, which narrows the search to just those words starting with that letter.

For example, from the root, any words that begin with n are found by following the reference to the n child node. From there, following o leads to just those words that begin with no, and so on recursively.

If two words have the same prefix, they share that initial part of their paths, which saves space because many words shared prefixes.

Each node has a boolean isWord flag which indicates that the path taken from the root to this node represents a word.
Here is a picture of a small trie:

The red outline of a node indicates its isWord flag is true. In effect, this trie contains the following words in order:

        a
        are
        as
        new
        no
        not
        zen

Strings such as ze or ar are not valid words for this trie because the path for those strings ends at a node where isWord is false. Any path not drawn is assumed not to exist, so strings such as cat or next are not valid because there is no such path in this trie.

Like other trees, a trie is a recursive data structure. All of the children of a given trie node are themselves smaller tries.

Design Discussion

For each node in the trie, you need a list of references to child nodes. In the sample trie drawn above, the root node has three children, one each for the letters a, n, and z.

One possibility for storing the references to children is a statically-sized 26-member array of references to nodes, where
array[0] is the child for a,
array[1] refers to b,
... and
array[25] refers to z.

When there is no child for a given letter, (as from b to m above) the array entry would be null. This arrangement makes it trivial to find the child for a given letter--you simply access the correct element in the array indexing by letter.

However, for most nodes within the trie, very few of the 26 references are needed, so using a largely null 26-element array may be too expensive. Better alternatives to keep child references exist. The choice of a space-efficient design is up to you, but you should explain your idea in your design document and justify your final choice in your readme.

Two things you may want to consider:

there are at most 26 children, so even an O(n) operation to find a particular child is no big deal, and
operations such as iteration require traversing the words in alphabetical order,
so keeping the list of children references sorted by letter is advantageous.

Implementation

LexiconNode

Begin implementing your trie by constructing a LexiconNode class. Note that it is efficient to keep your LexiconNodes in sorted order, so implementing the Comparable interface is important. This functionality is provided.

After implementing the LexiconNode class, you may define a constructor in LexiconTrie that creates an empty LexiconTrie to represent an empty word list.

Searching

To search for a word in the trie it is necessary to implement the containsWord method of LexiconTrie. To search for a prefix the method containsPrefix must be implemented.

Searching the trie is a matter of tracing out the path letter by letter. Let's consider three examples on the sample trie shown above.

Searching for the word new is done as follow.

Starting at the root node, the children are examined to find if one points to the first letter n.
Since the first letter exists on the first level, the process is continued recursively from that child root using the remaining part of the string, ew.
Since the second letter e exists among its children, its reference is followed, and recursion used to match w.
Once at the w node, there are no more letters remaining in the input, so this is the last node.
The isWord field of this node is true, indicating that the path to this node is a word contained in the lexicon.

Next, consider a search for ar.

While the path exists: it traces down the structure all the letters one by one,
the isWord flag on the last node is false,

indicating that path, ar, is not a word. It is, however, a prefix of other words in the trie: containsPrefix returns true given ar.

Finally, consider a search for nap.

The first letter n is a child of the root, but there is no a node as child of that node n.
The path for the string nap does not exist in the trie: nap is neither a word nor a prefix in this trie.

Adding

To add a new word into the trie it is necessary to implement the addWord method of the LexiconTrie.addWord needs to be developed in parallel with containsWord to be able to test your structure: the two methods have strong similarities in terms of the path they take. A word shouldn't be added if is already there.

Adding to a trie is a matter of

tracing out its path starting from the root, as if searching,
adding any part of the path that does not exist, i.e., missing nodes are inserted to the trie, and finally
setting the isWord flag to true for the final node.

Different cases are encountered when adding a new word.

Adding a new node for each letter.
For example, adding the word dot to our sample trie requires three new nodes, one for each letter.
Adding a few nodes only.
For example, adding the word news to our sample trie requires only an s child to the end of existing path for new.
Adding no node.
For example, adding the word do to our sample trie, after dot has been added, does not require any new nodes at all. Only the isWord flag on an existing node is set to true.

Here is the sample trie after those three words have been added:

A trie is an unusual data structure in that its performance can improve as it becomes more loaded.
Instead of slowing down as it gets full, it becomes faster to add words when they can share common prefix nodes with words already in the trie.

Removing

Once you have tested the add and contain methods for word and prefix, you are ready to implement the remove functionality. Removing a word in the trie requires

tracing its path so as to turn off the isWord flag of the word last node and
removing any part of the ex-word path that is now a dead end.

For example, if you remove the words zen and not from the following trie,

you obtain

Deleting unneeded nodes is tricky because of the recursive nature of the trie. Think about how we removed the last element of a single linked list: a reference to the second to last element was needed to update the references appropriately. A similar idea is needed for our trie.

General observations

There should never exist a leaf node whose isWord field is false. If a node has no children, i.e. is a leaf, and does not represent a valid word (i.e., isWord == false), then this node is not part of a path to a valid word in the trie and such nodes should be deleted when removing a word.
It is possible to remove a word from the trie without removing any nodes. For example, to remove the word new from the above trie, one need only to turn off isWord in the w node. All nodes along its path are still in use for other words.

Important note
When removing a word from the trie, the nodes on the path to the word only are needed. It would be extremely inefficient to check additional nodes that are not on the path.

Other Operations

You are also required to implement the following features.

Maintaining and returning the number of words, numWords stored in the LexiconTrie.
Reading words from a file using a Scanner to add to a trie automatically: addWordFromFile. You should be able to use the code in LexiconOrderedStructure to accomplish this.
Traversing the trie to list in order all the contained words in the lexicon, implementing an iterator for the trie data structure. An iterator permits a recursive exploration of all paths through the trie.
- Write a LexiconIterator class that implements the Iterator interface: the iterator returns the words (not prefixes) in alphabetical order.
- Refer to previous examples of iterator implementations. (Pre-computing all the paths and storing their results in an array is inappropriate as it is using the memory the trie purpose is to save)

Notes

Lexicon operations are case-insensitive: searching for words has the same behavior for both for upper and lowercase inputs. Be sure to take that into consideration when designing your data structure and algorithms.
Build and test incrementally. Develop your trie one method at a time and test as you go. The LexiconTester program exercises the lexicon and allows you to drive the testing interactively from the console.
Use method stubs as placeholders. The LexiconTester code makes calls to all of the public methods of the Lexicon interface. In order for the program to compile, you must have implementations of all the functions.
However, this doesn't suggest that you need to write all the code first and then attempt to debug it all at once.
Test on smaller data first.
- Use small.txt and small2.txt or create word data files of your own that test specific trie configurations.
- Use ospd2.txt a is very large word file for stress-testing once you have the basics in place. It would be overwhelming to try to debug with ospd2.txt.

Bonus

To augment the lexicon the following two methods can be added.

public Set<String> suggestCorrections(String target, int maxDistance), and
public Set<String> matchRegex(String pattern) .

Using the LexiconOrderedList implementation in the LexiconTester investigate the suggest & match features.

Suggesting Corrections

The main spelling correction feature is the suggestCorrections method. In the spelling checker the distance between two equal-length words is the number of letters in which they differ (comparing character by character). For example,

place and peace have distance 1 and
place and plank have distance 2.

Given a (possibly misspelled) target string and a maximum distance, this method gathers from the lexicon the set of words that have a distance to the target string less than or equal to the given maxDistance. The returned set contains all words in the lexicon that are the same length as the target string and are within the maximum distance.
For example, consider the original trie:

Calls to suggestCorrections with the following target string and maximum distance return the following results.

Target string	Max distance	Suggested corrections
`ben`	1	`zen`
`nat`	2	`new, not`

Here are a few examples of suggestCorrections run on a lexicon containing all the words in ospd2.txt:

Target string	Max distance	Suggested corrections
`crw`	1	`caw, cow, cry`
`zqwp`	2	`gawp, yawp`

Find suggested corrections by recursive traversal of the trie, gathering "neighbors" that are close enough to the target path.

Important note
Examining every word in the lexicon to determine if it is close enough is hopelessly inefficient. Generate suggestions by traversing the path of the target string taking "detours" only to neighbors within the maximum distance.

Matching Regular Expressions

Matching regular expressions is an additional feature you can implement. It also requires recursion. The matchRegex method takes as input a regular expression and returns the set of lexicon words that match it.

A regular expression describes a string-matching pattern: it is composed of ordinary alphabetic letters and wildcard charcaters.

Alphabetic letters in the pattern must match letters in the candidate word exactly.
Wildcard characters occur where the candidate word is allowed to vary.

The wildcard characters, we are considering in this lab, are the following two.

The * wildcard character matches any sequence of zero or more characters.
The ? wildcard character matches either zero or one character.

For example, consider the original trie.

Here are matches for some sample regular expression.

Regular expression	Matching words from lexicon
`a*`	`a, are, as`
`a?`	`a, as`
`e`	`are, new, zen`
`not`	`not`
`zabc?d`
`*o?`	`no, not`

Finding the words that match a regular expression requires careful application of your recursive programming skills.

Important note
You should not find suggestions by examining each word in the lexicon and seeing if it is a match. Instead, think about how to generate matches by traversing paths in the trie.

For non-wildcard characters, the paths grows as it does for traversing ordinary words.
On wildcard characters "fan out" the search to include all possibilities for that wildcard.
Start by implementing one of the two wildcard characters.

Submit

Due on Moodleby Thursday April 30th, 11:55 P.M. No late submission will be accepted. Submit the following four files LexiconTrie.java, LexiconNode.java, LexiconIterator and a readme.

The readme should include

your student information at the top and
the explanation for your LexiconIterator implementation, which internal data structure you used and how the cursor proceeds through the true,
If you have not fully implemented the requirements of this lab, then list the portions that are incomplete.

The program design grade will be based on the design choices you made in the implementation of the LexiconTrie class. The program style grade will be based on code formatting and appropriate use of Java naming conventions.