CMPS 260 (Theoretical Foundations of CS)
The CYK Algorithm
The CYK algorithm (named for Cocke, Young, and Kasami, each of whom develeped it independently of the others in the mid-1960's) solves the membership problem for context-free grammars in Chomsky Normal Form. That is, given as input a CFG G in Chomsky Normal Form (CNF) and a string w, the algorithm determines whether or not w Î L(G). Because any context-free grammar can be transformed into CNF (with the possible loss of the empty string from the generated language), this gives us a way of solving the membership problem for all CFG's. Letting G and w denote its two inputs (and assuming that CYK is a boolean function that, given a CNF grammar and a string, returns true iff the string is generated by the grammar), the algorithm is as follows:
function Member_of( G : CFG; w : string ) return boolean is
begin
if w=e then
if S is erasable --(where S is the start symbol of G)
then return true;
else return false;
end if;
else -- w /= e
G' := CNF grammar generating L(G) - {e};
return CYK(G',w);
end if;
Recall that a context-free grammar is said to be in Chomsky normal form if all its rules are of one of the two forms A ® b or A ® BC, where b is a terminal symbol and B and C are nonterminals.
The CYK algorithm is based on the following. Let G be a context-free grammar in Chomsky normal form, and let w = a1 a2 ¼an (ai Î S) be a string over the terminal alphabet S of G. For i and j satisfying 1 £ i £ j £ n, let wi,j denote the substring ai ai+1 ¼aj of w beginning with its i-th symbol and ending with its j-th symbol, and let Ni,j denote the set of nonterminals in G from which wi,j can be derived. That is,
|
Lemma 1: For all i, Ni,i = { A : A ® ai is a rule in G }
Proof: Because G is in CNF, the only way that a string of length one can be derived from a nonterminal symbol is via an application of a rule of the form A ® b.
Lemma 2: For all i and j satisfying 1 £ i < j £ n, A Î Ni,j if and only if there exist nonterminals B and C and a number k satisfying i £ k < j such that A ® BC is a rule in G, B Î Ni,k, and C Î Nk+1,j.
Proof: Sufficiency (if):
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Necessity (only if):
| ||||||||||||||||||||||||||||||||||||||||||||||
End of proof of Lemma 2.
Lemma 3: w Î L(G) if and only if S Î N1,n, where S is the start symbol of G.
Proof:
| ||||||||||||||||||||||||||||||||||
>From Lemma 3, it follows that, in order to determine whether w Î L(G), it suffices to compute the set N1,n and then to check whether S is a member of that set. But how can we compute N1,n? Lemmas 1 and 2 provide strong suggestions. Lemma 1 tells us that, for any i, in order to compute Ni,i it suffices to examine each rule in G. Lemma 2 tells us that, for any i and m such that 1 £ i < i+m £ n, in order to compute Ni,i+m it suffices to examine each rule in G, as well as the sets Ni,k and Nk+1,i+m for k satisfying i £ k < i+m. From this we conclude that the ``correct'' order in which to compute the Ni,i+m's is in increasing order of m. That way, each time we are to compute a particular set Ni,i+m, all the sets Ni,k and Nk+1,i+m (i £ k < i+m) on which its value depends have been computed already.
We arrive at the following algorithm. (For ease of typesetting, in the algorithm we enclose subscripts within square brackets.)
CYK Algorithm.
Input: CFG G in CNF, string w = a[1] a[2] a[3] ... a[n]
Output: YES if w in L(G), NO otherwise
for i in 1..n loop
N[i,i] := empty set;
for each nonterminal A in G loop
if A --> a[i] is a production in G then --note that a[i] = w[i,i]
insert A into N[i,i];
end if;
end loop;
end loop;
for m in 1..n-1 loop
--compute N[i,i+m] for i satisfying 1 <= i <= n-m
for i in 1..n-m loop
--compute N[i,i+m]
N[i,i+m] := empty set;
for k in i..i+m-1 loop
for each production A --> BC in G loop
if B is in N[i,k] and C is in N[k+1,i+m] then
insert A into N[i,i+m];
end if;
end loop;
end loop;
end loop;
end loop;
if S is in N[1,n] then --w in L(G) iff S is in N[1,n]
then return YES;
else return NO;
end if;