CMPS 260 Spring 2024
Prog. Assg. #1: Aho/Corasick Pattern Matching Machine
Due: 11:59pm, April 1

Background

First described in a seminal paper (to which you have access via Brightspace) appearing in the June 1975 issue of CACM (Communications of the ACM), an Aho/Corasick pattern matching machine (ACPM) is a DFA-like structure whose purpose is to make efficient the task of finding, within a given text, all occurrences of the members of a static, finite set of strings. Each string in this set is referred to as a keyword, and the set as a whole can be referred to as a lexicon.

An ACPM is essentially a trie (more descriptively called a prefix tree or a digital search tree) augmented with a failure function.

A trie is an edge-labeled tree that represents a finite lexicon. Let u be a prefix of some member uv of that lexicon. Then the trie has a node such that the labels on the edges of the path from the root to that node "spell out" u.

The first phase of construction of an ACPM takes a lexicon as input and builds the trie that represents it. As an example, consider the lexicon comprising this set of keywords:

{at, ate, era, hat, hate, hats, here, red, ten, tot}:

Below is the corresponding trie; for the reader's convenience, each node is labeled by the keyword-prefix that it represents. (Indeed, we will refer to each node by that string.) Nodes depicted by a double circle are those corresponding to keywords. We call those final nodes.

The trie edges define what Aho and Corasick refer to as the goto function, which maps each (node, symbol) pair to a node. For example, in the trie above goto(hat, e) = hate and goto(hat, s) = hats. For every symbol z other than e or s, goto(hat, z) is undefined. For technical reasons, Aho and Corasick define (as an exception to this rule) goto(λ, z) = λ for every symbol z that does not label any outgoing edge from λ (which is the root node, of course).

The second phase of construction is to compute the failure function, which maps each node x to its failure-node, failure(x). Let x be a node. Then failure(x) = v iff v is the longest proper suffix of x that is also a prefix of some keyword. (Note that the root node has no failure-node because λ has no proper suffix.)

Below is the same trie as shown above, but now each node's failure node is indicated, either explicitly or implicitly. Each red edge goes from a node to its failure node. All failure edges not shown go to the root node, λ. Also, each node's label now shows not only the keyword-prefix represented by the node but also (underneath it) the identity of its failure node. (λ is the failure node of every node representing a string of length one, so we omit those edges so as to avoid cluttering the diagram any further.)

One could employ a brute-force approach when computing each node's failure-node. Let x be a prefix of length n of some keyword. Then to determine the failure-node of node x, we can search in the trie for its suffix of length n-1, then for its suffix of length n-2, etc., etc., until finding one, say v, that is in the trie (and thus is a prefix of some keyword). Then failure(x) = v. If none of x's nonempty proper suffixes is in the trie, then failure(x) = λ.

Aho and Corasick describe a much more efficient way of computing the failure function, however. Their algorithm does a breadth-first traversal of the trie, computing each node's failure-node along the way. This approach exploits the fact that, if we already know failure(x) for every node x satisfying |x| ≤ n, then we can use that information to quickly compute failure(y), where |y| = n+1.

Recall that there is a one-to-one correspondence between the nodes in the trie and prefixes of keywords in the lexicon. Let y = xc, where x ∈ Σ^* and c ∈ Σ. Then every nonempty suffix of y = xc is of the form uc, where u is some suffix of x. Which means that the longest proper suffix of y = xc in the trie is wc, where w is the longest proper suffix of x whose corresponding node has an outgoing edge labeled c. (It's possible that no such w exists, in which case the longest proper suffix of y = xc in the trie is λ.)

Because a string of length one has no proper prefix except λ, clearly failure(c) = λ for all c ∈ Σ. For longer strings, the reasoning offered in the previous paragraph leads us to this algorithm for computing failure(y) for |y| > 1:

Let y = xc, where |x| = n and c ∈ Σ. Then y's parent node is x, of course, and the child-edge from x to y is labeled c. Starting at node x, follow a nonempty sequence of failure edges until reaching a node w having an outgoing child-edge labeled c (to node wc, of course). Then failure(y) = wc. If no such node w exists, then failure(y) = λ.

What follows are applications of the algorithm described in the previous paragraph to the hate, here, and hats nodes in the ACPM shown above. The assumption is that the failure-nodes of all nodes x, where |x| ≤ 3, have been computed already.

Example 1: Calculation of failure(hate). We follow the failure-edge from hate's parent, hat, which goes to at. That node has an outgoing child-edge labeled e (the last character in hate) to ate. Hence, failure(hate) = ate.

Example 2: Calculation of failure(here). We follow the failure-edge from here's parent, her, which goes to er. Because er has no outgoing child edge labeled e, we follow its failure-edge to node r. That node has an outgoing child-edge labeled e (the last character in here) to re. Hence, failure(here) = re.

Example 3: Calculation of failure(hats). We follow the failure-edge from hats's parent, hat, which goes to at. Because at has no outgoing child edge labeled s, we follow its failure-edge to node t. It, too, lacks an outgoing child edge labeled s, so we following its failure-edge to λ. But λ has no outgoing child edge labeled s, either. At this point, there are no more failure edges to follow, so we conclude that failure(hats) = λ.

Your Task

Provided, in full, are the following Java artifacts:

PatternMatcher: Java interface that specifies the methods that a "pattern matching machine" should include.
ACPM: Java class that implements PatternMatcher. Instances of this class are specifically Aho/Corasick pattern matching machines, as described above (and in their 1975 CACM paper).
ACPM_Tester: Java application whose purpose is to test the ACPM class and the others that support it.
keywords.txt: file containing a collection of words that could serve as the lexicon for an ACPM (and could be used as input by the ACPM_Tester program).
story.txt: file containing text that could be used by the ACPM_Tester program for the purpose of finding occurrences of keywords within it.
output.txt: file containing the output produced by the findMatches() method (in the ACPM_Tester application) when the lexicon is as described by the keywords.txt file and the text is that in the story.txt file.

The following are Java classes that have some missing pieces that are to be supplied by the student:

ACPM_Node: Instances of this class represent nodes within an ACPM. Three of its methods are stubbed, one of which is only to aid in debugging.
ACPM_Lexicon_Builder: An instance of this class is for the purpose of augmenting the structure of an under-construction ACPM to expand its lexicon (i.e, the set of keywords that it can match against subject text). Two of its methods are stubbed, one of which is only meant as a debugging aid.
ACPM_FF_Builder: This class has a method, to be completed by the student, that computes the failure function of an ACPM whose lexicon already has been established, thus completing its construction. The mentioned method includes comments that provide guidance to the student.

CMPS 260 Spring 2024 Prog. Assg. #1: Aho/Corasick Pattern Matching Machine Due: 11:59pm, April 1

Background

Your Task

CMPS 260 Spring 2024
Prog. Assg. #1: Aho/Corasick Pattern Matching Machine
Due: 11:59pm, April 1