CMPS 260 Spring 2024
Prog. Assg. #1: Aho/Corasick Pattern Matching Machine
Due: 11:59pm, April 1

Background

First described in a seminal paper (to which you have access via Brightspace) appearing in the June 1975 issue of CACM (Communications of the ACM), an Aho/Corasick pattern matching machine (ACPM) is a DFA-like structure whose purpose is to make efficient the task of finding, within a given text, all occurrences of the members of a static, finite set of strings. Each string in this set is referred to as a keyword, and the set as a whole can be referred to as a lexicon.

An ACPM is essentially a trie (more descriptively called a prefix tree or a digital search tree) augmented with a failure function.

A trie is an edge-labeled tree that represents a finite lexicon. Let u be a prefix of some member uv of that lexicon. Then the trie has a node such that the labels on the edges of the path from the root to that node "spell out" u.

The first phase of construction of an ACPM takes a lexicon as input and builds the trie that represents it. As an example, consider the lexicon comprising this set of keywords:

{at, ate, era, hat, hate, hats, here, red, ten, tot}:

Below is the corresponding trie; for the reader's convenience, each node is labeled by the keyword-prefix that it represents. (Indeed, we will refer to each node by that string.) Nodes depicted by a double circle are those corresponding to keywords. We call those final nodes.

The trie edges define what Aho and Corasick refer to as the goto function, which maps each (node, symbol) pair to a node. For example, in the trie above goto(hat, e) = hate and goto(hat, s) = hats. For every symbol z other than e or s, goto(hat, z) is undefined. For technical reasons, Aho and Corasick define (as an exception to this rule) goto(λ, z) = λ for every symbol z that does not label any outgoing edge from λ (which is the root node, of course).

The second phase of construction is to compute the failure function, which maps each node x to its failure-node, failure(x). Let x be a node. Then failure(x) = v iff v is the longest proper suffix of x that is also a prefix of some keyword. (Note that the root node has no failure-node because λ has no proper suffix.)

Below is the same trie as shown above, but now each node's failure node is indicated, either explicitly or implicitly. Each red edge goes from a node to its failure node. All failure edges not shown go to the root node, λ. Also, each node's label now shows not only the keyword-prefix represented by the node but also (underneath it) the identity of its failure node. (λ is the failure node of every node representing a string of length one, so we omit those edges so as to avoid cluttering the diagram any further.)

One could employ a brute-force approach when computing each node's failure-node. Let x be a prefix of length n of some keyword. Then to determine the failure-node of node x, we can search in the trie for its suffix of length n-1, then for its suffix of length n-2, etc., etc., until finding one, say v, that is in the trie (and thus is a prefix of some keyword). Then failure(x) = v. If none of x's nonempty proper suffixes is in the trie, then failure(x) = λ.

Aho and Corasick describe a much more efficient way of computing the failure function, however. Their algorithm does a breadth-first traversal of the trie, computing each node's failure-node along the way. This approach exploits the fact that, if we already know failure(x) for every node x satisfying |x| ≤ n, then we can use that information to quickly compute failure(y), where |y| = n+1.

Recall that there is a one-to-one correspondence between the nodes in the trie and prefixes of keywords in the lexicon. Let y = xc, where x ∈ Σ* and c ∈ Σ. Then every nonempty suffix of y = xc is of the form uc, where u is some suffix of x. Which means that the longest proper suffix of y = xc in the trie is wc, where w is the longest proper suffix of x whose corresponding node has an outgoing edge labeled c. (It's possible that no such w exists, in which case the longest proper suffix of y = xc in the trie is λ.)

Because a string of length one has no proper prefix except λ, clearly failure(c) = λ for all c ∈ Σ. For longer strings, the reasoning offered in the previous paragraph leads us to this algorithm for computing failure(y) for |y| > 1:

Let y = xc, where |x| = n and c ∈ Σ. Then y's parent node is x, of course, and the child-edge from x to y is labeled c. Starting at node x, follow a nonempty sequence of failure edges until reaching a node w having an outgoing child-edge labeled c (to node wc, of course). Then failure(y) = wc. If no such node w exists, then failure(y) = λ.

What follows are applications of the algorithm described in the previous paragraph to the hate, here, and hats nodes in the ACPM shown above. The assumption is that the failure-nodes of all nodes x, where |x| ≤ 3, have been computed already.

Example 1: Calculation of failure(hate). We follow the failure-edge from hate's parent, hat, which goes to at. That node has an outgoing child-edge labeled e (the last character in hate) to ate. Hence, failure(hate) = ate.

Example 2: Calculation of failure(here). We follow the failure-edge from here's parent, her, which goes to er. Because er has no outgoing child edge labeled e, we follow its failure-edge to node r. That node has an outgoing child-edge labeled e (the last character in here) to re. Hence, failure(here) = re.

Example 3: Calculation of failure(hats). We follow the failure-edge from hats's parent, hat, which goes to at. Because at has no outgoing child edge labeled s, we follow its failure-edge to node t. It, too, lacks an outgoing child edge labeled s, so we following its failure-edge to λ. But λ has no outgoing child edge labeled s, either. At this point, there are no more failure edges to follow, so we conclude that failure(hats) = λ.


Your Task

Provided, in full, are the following Java artifacts:

The following are Java classes that have some missing pieces that are to be supplied by the student: