Lexicon

In this homework you develop a data structure to manage a lexicon, storing efficiently in memory all the words of a dictionary. All modern word processors spell checkers use such lexicon to suggest alternatives to misspelled words‡. We could reuse this data structure for another application that mimics a spell checker tool (talk to me if you are interested to do this).

The objectives of this homework are

  1. to create a recursive tree structure,
  2. to write efficient algorithms for accessing it,
  3. to design and implement advanced functionality for a recursive tree structure and
  4. to develop sophisticated and efficient traversal algorithms.
‡ This homework is based on an assignment originally created by Julie Zelenski, Justin Manus, and Mehran Sahami at Stanford University, modified by Jeannie Albrecht and Steve Freund at Williams College, and by James Teresco at Mount Holyoke College.

It is a violation to the Honor Code not to acknowledge others' work fully and openly. List all your collaborators and any person that helped you. You should cite any online resources you consulted to help you solve this project.
You may NOT directly copy any code or any text from anywhere.

Getting Started

First, carefully read this handout. You will have questions, so leave plenty of time to discuss this homework with the course staff.

You are encouraged to write a design document that describes your plan before you proceed with your implementation. Bring your design document when you come to ask questions: we will request to see it to help you better.

The provided code contains a demo that uses bad implementations of a lexicon. Their code are included, using either a LexiconOrderedList object or a LexiconOrderedVector object. The demo purpose is to illustrate the power of class/interface design and to demonstrate the behaviour expected at the command line interface.

Your task is to implement a data structure, a LexiconTrie, that provides the same features more efficiently: running faster and using less memory. To run YOUR code in LexiconTester.java switch the commented lines as follow

//Lexicon lex = new LexiconOrderedList(); //Lexicon lex = new LexiconOrderedVector(); 
Lexicon lex = new LexiconTrie();

Inefficient Implementations

The provided program contains two functional, but inefficient, implementations of a lexicon, along with a tester program, LexiconTester.java, that for now uses the inefficient data structures.

Each bad solution implements the Lexicon interface. Your solution will also implement the Lexicon interface.

  1. LexiconOrderedVector uses an OrderedVector.
  2. LexiconOrderedList uses an OrderedList.

Note that most of the code for them is the same in both implementation and has been factored out into the LexiconOrderedStructure abstract class. Familiarize yourself with the provided code by looking at the definition of the interfaces and (abstract) classes. Don't study them in details. Large parts of the implementation of the LexiconTester class are fairly complex and interesting but unimportant in its detail for our purpose.

Focus your attention on the LexiconTester interaction with the Lexicon classes, i.e., run the main method of the LexiconTester class and experiment with the console interface to create, explore and manipulate a lexicon instance.

Word Lists

To initialize the lexicon structure automatically you are provided with three files that contain word lists.

Load the small examples to create a lexicon; modify the structure adding/removing words and displaying its content.

The Lexicon Trie

Your task is to implement the classes LexiconTrie.java and LexiconNode.java to manage a lexicon efficiently.
The LexiconTrie implements a special kind of tree called a trie. The term trie comes from re-trie-val. This term was coined by Edward Fredkin, who pronounces it /#tri#/ "tree". However, other authors pronounced it /#tra#/ "try", in an attempt to verbally distinguish it from "tree". The zip file contains a pdf describing formally a Trie: read it to understand the formal definition of a trie.

LexiconTrie.java implements the Lexicon interface provided in the starter code. Some methods are straightforward and intuitive, others are more difficult: the methods you are required to implement are the fundamental ones and are relatively simple; other methods are there for challenge. Do the latter to earn bonus points! For information about the usage and purpose of these methods, read the comments in Lexicon.java. For implementation details read below.

To make use of your LexiconTrie in the LexiconTester program, don't forget to change the call in LexiconTester to construct a LexiconTrie (instead of a LexiconOrderedList).

The Trie Structure

A trie is a letter-tree that efficiently stores strings (in our case, words). A node in a trie represents a letter. A path through the trie traces out a sequence of letters that represents a prefix or word in the lexicon.

Unlike a binary tree, each trie node has up to 26 child references (one for each letter of the alphabet). While a binary search tree eliminates half the words with a left or right turn at each step in a search, a trie eliminates 25/26 possible continuations as the search follows the child reference for the next letter, which narrows the search to just those words starting with that letter.

For example, from the root, any words that begin with n are found by following the reference to the n child node. From there, following o leads to just those words that begin with no, and so on recursively.

If two words have the same prefix, they share that initial part of their paths, which saves space because many words shared prefixes.

Each node has a boolean isWord flag which indicates that the path taken from the root to this node represents a word.
Here is a picture of a small trie:

The red outline of a node indicates its isWord flag is true. In effect, this trie contains the following words in order:

        a
        are
        as
        new
        no
        not
        zen
Strings such as ze or ar are not valid words for this trie because the path for those strings ends at a node where isWord is false. Any path not drawn is assumed not to exist, so strings such as cat or next are not valid because there is no such path in this trie.

Like other trees, a trie is a recursive data structure. All of the children of a given trie node are themselves smaller tries.

Design Discussion

For each node in the trie, you need a list of references to child nodes. In the sample trie drawn above, the root node has three children, one each for the letters a, n, and z.

One possibility for storing the references to children is a statically-sized 26-member array of references to nodes, where
array[0] is the child for a,
array[1] refers to b,
... and
array[25] refers to z.

When there is no child for a given letter, (as from b to m above) the array entry would be null. This arrangement makes it trivial to find the child for a given letter--you simply access the correct element in the array indexing by letter.

However, for most nodes within the trie, very few of the 26 references are needed, so using a largely null 26-element array may be too expensive. Better alternatives to keep child references exist. The choice of a space-efficient design is up to you, but you should explain your idea in your design document and justify your final choice in your readme.

Two things you may want to consider:

Implementation

LexiconNode

Begin implementing your trie by constructing a LexiconNode class. Note that it is efficient to keep your LexiconNodes in sorted order, so implementing the Comparable interface is important. This functionality is provided.

After implementing the LexiconNode class, you may define a constructor in LexiconTrie that creates an empty LexiconTrie to represent an empty word list.

Searching

To search for a word in the trie it is necessary to implement the containsWord method of LexiconTrie. To search for a prefix the method containsPrefix must be implemented.

Searching the trie is a matter of tracing out the path letter by letter. Let's consider three examples on the sample trie shown above.

Searching for the word new is done as follow.

  1. Starting at the root node, the children are examined to find if one points to the first letter n.
  2. Since the first letter exists on the first level, the process is continued recursively from that child root using the remaining part of the string, ew.
  3. Since the second letter e exists among its children, its reference is followed, and recursion used to match w.
  4. Once at the w node, there are no more letters remaining in the input, so this is the last node.
  5. The isWord field of this node is true, indicating that the path to this node is a word contained in the lexicon.

Next, consider a search for ar.

indicating that path, ar, is not a word. It is, however, a prefix of other words in the trie: containsPrefix returns true given ar.

Finally, consider a search for nap.

Adding

To add a new word into the trie it is necessary to implement the addWord method of the LexiconTrie.addWord needs to be developed in parallel with containsWord to be able to test your structure: the two methods have strong similarities in terms of the path they take. A word shouldn't be added if is already there.

Adding to a trie is a matter of

  1. tracing out its path starting from the root, as if searching,
  2. adding any part of the path that does not exist, i.e., missing nodes are inserted to the trie, and finally
  3. setting the isWord flag to true for the final node.
Different cases are encountered when adding a new word. Here is the sample trie after those three words have been added:

A trie is an unusual data structure in that its performance can improve as it becomes more loaded.
Instead of slowing down as it gets full, it becomes faster to add words when they can share common prefix nodes with words already in the trie.

Removing

Once you have tested the add and contain methods for word and prefix, you are ready to implement the remove functionality. Removing a word in the trie requires For example, if you remove the words zen and not from the following trie,

you obtain

Deleting unneeded nodes is tricky because of the recursive nature of the trie. Think about how we removed the last element of a single linked list: a reference to the second to last element was needed to update the references appropriately. A similar idea is needed for our trie.

General observations

Important note
When removing a word from the trie, the nodes on the path to the word only are needed. It would be extremely inefficient to check additional nodes that are not on the path.

Other Operations

You are also required to implement the following features.
  1. Maintaining and returning the number of words, numWords stored in the LexiconTrie.
  2. Reading words from a file using a Scanner to add to a trie automatically: addWordFromFile. You should be able to use the code in LexiconOrderedStructure to accomplish this.
  3. Traversing the trie to list in order all the contained words in the lexicon, implementing an iterator for the trie data structure. An iterator permits a recursive exploration of all paths through the trie.
    • Write a LexiconIterator class that implements the Iterator interface: the iterator returns the words (not prefixes) in alphabetical order.
    • Refer to previous examples of iterator implementations. (Pre-computing all the paths and storing their results in an array is inappropriate as it is using the memory the trie purpose is to save)

Notes

Bonus

To augment the lexicon the following two methods can be added.
  1. public Set<String> suggestCorrections(String target, int maxDistance), and
  2. public Set<String> matchRegex(String pattern) .
Using the LexiconOrderedList implementation in the LexiconTester investigate the suggest & match features.

Suggesting Corrections

The main spelling correction feature is the suggestCorrections method. In the spelling checker the distance between two equal-length words is the number of letters in which they differ (comparing character by character). For example, Given a (possibly misspelled) target string and a maximum distance, this method gathers from the lexicon the set of words that have a distance to the target string less than or equal to the given maxDistance. The returned set contains all words in the lexicon that are the same length as the target string and are within the maximum distance.
For example, consider the original trie:

Calls to suggestCorrections with the following target string and maximum distance return the following results.

Target string Max distance Suggested corrections
ben 1 zen
nat 2 new, not

Here are a few examples of suggestCorrections run on a lexicon containing all the words in ospd2.txt:

Target string Max distance Suggested corrections
crw 1 caw, cow, cry
zqwp 2 gawp, yawp

Find suggested corrections by recursive traversal of the trie, gathering "neighbors" that are close enough to the target path.

Important note
Examining every word in the lexicon to determine if it is close enough is hopelessly inefficient. Generate suggestions by traversing the path of the target string taking "detours" only to neighbors within the maximum distance.

Matching Regular Expressions

Matching regular expressions is an additional feature you can implement. It also requires recursion. The matchRegex method takes as input a regular expression and returns the set of lexicon words that match it.

A regular expression describes a string-matching pattern: it is composed of ordinary alphabetic letters and wildcard charcaters.

The wildcard characters, we are considering in this lab, are the following two.

  1. The * wildcard character matches any sequence of zero or more characters.
  2. The ? wildcard character matches either zero or one character.

For example, consider the original trie.

Here are matches for some sample regular expression.

Regular expression Matching words from lexicon
a* a, are, as
a? a, as
*e* are, new, zen
not not
z*abc?*d
*o? no, not

Finding the words that match a regular expression requires careful application of your recursive programming skills.

Important note
You should not find suggestions by examining each word in the lexicon and seeing if it is a match. Instead, think about how to generate matches by traversing paths in the trie.

Submit

Due on Moodleby Thursday April 30th, 11:55 P.M. No late submission will be accepted. Submit the following four files LexiconTrie.java, LexiconNode.java, LexiconIterator and a readme.

The readme should include

The program design grade will be based on the design choices you made in the implementation of the LexiconTrie class. The program style grade will be based on code formatting and appropriate use of Java naming conventions.