Computing FIRST() and FOLLOW() for a CFG

Let G = (V, Σ, S, P) be a context-free grammar (CFG). V is the set of nonterminal symbols, Σ is the set of terminal symbols, S ∈ V is the "start" symbol, and P ∈ V × (V ∪ Σ)* is the set of productions. (A production (A, α) ∈ P is usually expressed in the form A ⟶ α.) For the sake of brevity, we use Γ = V ∪ Σ to refer to the set containing all terminal and nonterminal symbols of G. Also, we use Σ' to refer to Σ ∪ {$}. The $ symbol, which is assumed not to be a member of Γ, serves as an end-of-input-string marker.

This document describes algorithms by which to compute FIRST(Z) for all Z ∈ Γ and FOLLOW(A) for all A ∈ V. (We are not concerned with the follow sets of termainal symbols.)

Recall the following definition:

FIRST'(Z) = { t ∈ Σ | Z ⟹* tα for some α ∈ Γ* }

That is, if t ∈ Σ appears as the first (i.e., leftmost) symbol in a string derivable from Z, then t ∈ FIRST'(Z). (Of course, if Z is a terminal symbol, then FIRST(Z) = {Z}.

In the case that Z ⟹* λ (i.e., Z is "nullable", which obviously implies Z ∈ V), FIRST(Z) = FIRST'(Z) ∪ {λ}. If Z is not nullable, then FIRST(Z) = FIRST'(Z).

As for FOLLOW():

FOLLOW(A) = { t ∈ Σ' | S$ ⟹* αAtβ$ for some α,β ∈ Γ* }

That is, if t ∈ Σ' appears immediately after A in a string derivable from S$, then t ∈ FOLLOW(A).

Having the values of FIRST(A) and FOLLOW(A) for every nonterminal symbol A in grammar G is vital in devising a one-symbol-lookahead stack machine that accepts L(G) or, what is really the same thing, devising a parse table for G that guides the standard top-down parsing algorithm. (Each cell in a parse table identifies which (if any) production(s) of the grammar are viable candidates to be applyed next, as a function of the next input symbol and the nonterminal currently on the top of the stack. (A "viable" production is one that, if applied next, could possibly lead to a successful parse of the input string. Whether or not it can lead to a successful parse depends upon the suffix of the input string that follows the next input symbol.)

For a CFG to qualify as being LL(1), there can be at most one viable production for each pair (A, b) ∈ V×Σ'.

Step 1: Identify the nullable nonterminals.

q := empty queue
N := ∅  //  N is the set of variables known to be nullable
do for each λ-production A ⟶ λ
|  |  N := N ∪ {A}
|  |  q.enqueue(A)
|  fi
od

// At this point, all nonterminals that produce λ in one step
// (i.e., via the application of a single λ-production) are in N 
// and on the queue.  What follows is a loop to identify the 
// nonterminals from which λ is derivable by a sequence of 
// two or more applications of productions.

do while !q.isEmpty()
|  B := q.dequeue();
|  do for each production A ⟶ αBβ // in which B appears on RHS
|  |  if A ∉ N ∧ every symbol X in αβ satisfies X ∈ N
|  |  |  N := N ∪ A
|  |  |  q.enqueue(A)
|  |  fi
|  od
od
// At this point, N contains precisely the nullable nonterminals


Step 2: Calculation of FIRST()

As noted above, a nonterminal symbol A is said to be nullable iff A ⟹* λ. Generalizing that, a string α = X1X2···Xk ∈ Γ* is nullable iff X1X2···Xk* λ. (This condition is equivalent to each Xi, 1≤i≤m, itself being nullable. A special case of this occurs when α is λ, which corresponds to the case in which k=0.)

The algorithm below exploits the fact that, if A ⟹ X1X2···Xm is a production, 1≤k≤m, and X1X2···Xk-1 is nullable, then FIRST'(A) ⊆ FIRST'(Xk). To confirm that, suppose that Xk* for some t ∈ Σ and β ∈ Γ*, so that t ∈ FIRST'(Xk). Then

A ⟹ X1X2···Xm* XkXk+1···Xm* tβXk+1···Xm

demonstrating that t ∈ FIRST'(A), too.

In the algorithm, variable first() is used as a proxy for the mathematical function FIRST(), the intent being that, upon completion of execution, first(A) = FIRST(A) for every A ∈ V.

// Let N be the set of nullable nonterminal symbols,
// as computed by the algorithm described above.
N := { A ∈ V  |  A ⟹* λ }

do for each t ∈ Σ
|  first(t) := {t}
od

do for each A ∈ V
|  first(A) := ∅
od

boolean updateOccurred = true;
do while updateOccurred
|  updateOccurred := false;
|  do for each non-λ production A ⟶ X1X2···Xm
|  |  do for each k in [1..m] such that X1···Xk-1 is nullable
|  |  |  if first(Xk) - first(A) ≠ ∅ then
|  |  |  |  first(A) := first(A) ∪ first(Xk)
|  |  |  |  updateOccurred := true
|  |  |  fi   
|  |  od
|  od
od

// At this point, first(Z) = FIRST'(Z) for every Z ∈ Γ
do for each A ∈ N
|  first(A) := first(A) ∪ {λ}
od

// Now, first(Z) = FIRST(Z) for every Z ∈ Γ

The algorithm above describes a rather brute-force approach to computing FIRST(). An algorithm that would be more efficient (with respect to running time) appears in the appendix.


Step 3: Calculation of FOLLOW()

Above (and below, in the appendix) is shown an algorithm to compute FIRST : Γ ⟶ Σ ∪ {λ}. For the purposes of computing FOLLOW(), it is convenient to have extended versions of the functions FIRST' and FIRST whose domains include not just Γ (individual symbols) but also strings of length two or more. The definitions of the extended functions are

FIRST'*(Z1Z2··Zr) = { t ∈ Σ | Z1Z2··Zr* tα for some α }

If every Zi (1≤i≤r) is nullable, then FIRST*(Z1Z2··Zr) = FIRST'*(Z1Z2··Zr) ∪ {λ}

Here is a method for computing FIRST*:

function first*(Z1··Zr) :
   // Let N be the set of nullable nonterminal symbols,
   // as computed by the algorithm described above.
   result := FIRST'(Z1)
   j := 1
   do while (j < r  ∧  Zj ∈ N)
   |  result := result ∪ FIRST'(Zj+1)
   |  j := j+1
   od
   if j = r  ∧  Zr ∈ N then
   |  result := result ∪ {lambda;}
   fi
   return result

Making use of the first*() method, we can compute the FOLLOW() function:

// Let N be the set of nullable nonterminal symbols,
// as computed by the algorithm described above.
N := { A ∈ V  |  A ⟹* λ }

do for each A ∈ V - {S}
|  follow(A) := ∅
|  follow-1(A) := ∅
od

follow(S) := { $ }

do for each production A ⟶ X1··Xm
|  do for each k ∈ [1..m)
|  |  if Xk ∈ V
|  |  |  F := first*(Xk+1··Xm)
|  |  |  follow(Xk) := follow(Xk) ∪ (F - {λ})
|  |  |  if λ ∈ F then
|  |  |  |  follow-1(A) := follow-1(A) ∪ {Xk}
|  |  |  fi
|  |  fi
|  od
|  if Xm ∈ V
|  |  follow-1(A) := follow-1(A) ∪ {Xm)
|  fi
od
// At this point, for every B ∈ V, follow(B) includes 
// every t ∈ Σ such that there exists a production
// A ⟶ αBβ where β ⟹* tφ for some φ.
// Assuming (as we are) that every nonterminal symbol is 
// useful, this condition implies the existence of the
// derivation S ⟹* γAη ⟹ γαBβη ⟹* γαBtφη
// demonstrating that t ∈ FOLLOW(B).
// Meanwhile, for every A ∈ V, follow-1(A) includes every 
// nonterminal Xj such that for some production A ⟶ X1··Xm,
// Xj+1··Xm* λ, implying that FOLLOW(A) ⊆ FOLLOW(Xj).
// To demonstrate this, suppose that t ∈ FOLLOW(A).
// Then there is a derivation 
// S$ ⟹* αAtβ ⟹ αX1··Xmtβ ⟹* αX1··Xjtβ
// Hence, t ∈ FOLLOW(Xj), too.

// Now resolve all the FOLLOW(A) ⊆ FOLLOW(B) relationships
// indicated by follow-1:
q := empty queue
do for each A ∈ V 
|  if follow-1(A) ≠ ∅
|  |  q.enqueue(A)
|  fi
od

do while !q.isEmpty()
|  A := q.dequeue()
|  do for each B ∈ follow-1(A)
|  |   if follow(A) - follow(B) ≠ ∅ then
|  |   |  follow(B) := follow(B) ∪ follow(A)
|  |   |  if !q.inQueue(B)  then
|  |   |  |  q.enqueue(B)
|  |   |  fi
|  |   fi
|  od
od
// At this point, follow(A) = FOLLOW(A) for all A ∈ V.


Appendix: A Better way to compute FIRST()

The algorithm described earlier for computing the FIRST() function aimlessly iterates through every production in G "hoping" to find one whose left-hand side's first() value should be updated to include one or more new terminal symbols. Only after an unproductive iteration through all the productions does it recognize that there is nothing more to be done.

A better algorithm would, upon identifying a B ∈ V whose first() value needs to be updated would, after making that update, direct its attention to those nonterminals A ∈ V such that FIRST'(B) ⊆ FIRST'(A) by virtue of the fact that there is a production A ⟶ αBβ where α is nullable. (If first(A) does not include all the members of first(B), then first(A) needs to absorb all those members!) The algorithm below does this.

// Let N be the set of nullable nonterminal symbols,
// as computed by the algorithm described above.
N := { A ∈ V  |  A ⟹* λ }

do for each t ∈ Σ
|  first(t) := {t}
|  first-1(t) := ∅
od

do for each A ∈ V
|  first(A) := ∅
|  first-1(t) := ∅
od

do for each non-λ production A ⟶ X1X2···Xm
|  do for each k in [1..m] such that X1X2···Xk-1 is nullable
|  |  first-1(Xk) := first-1(Xk) ∪ {A}
|  od
od
// At this point, first-1(X) = { A ∈ V  |  A ⟶ αXβ for some nullable α }.
// Significance: for each A ∈ first-1(X), FIRST'(A) ⊆ FIRST'(X)

do for each t ∈ Σ
|  do for each A ∈ first-1(t)
|  |  first(A) := first(A) ∪ {t}
|  od
od
// At this point, for every A ∈ V, first(A) includes every t ∈ Σ
// such that A ⟶ αtβ is a production and α is nullable.

q := empty queue
do for each B ∈ V 
|  if first-1(B) ≠ ∅
|  |  q.enqueue(B)
|  fi
od
// At this point, the queue includes every B ∈ V for which there 
// exists some A ∈ V such that FIRST'(B) ⊆ FIRST'(A) by virtue
// of there being a production A ⟶ αBβ, where α is nullable.

do while !q.isEmpty()
|  B := q.dequeue()
|  do for each A ∈ first-1(B)
|  |  if first(B) - first(A) ≠ ∅
|  |  |  first(A) := first(A) ∪ first(B)
|  |  |  if first-1(A) ≠ ∅  ∧  !q.inQueue(A)
|  |  |  |  q.enqueue(A)
|  |  |  fi
|  |  fi
|  od
od

// At this point, first(A) = FIRST'(A) for every A ∈ V
do for each A ∈ N
|  first(A) := first(A) ∪ {λ}
od

// Now, first(A) = FIRST(A) for every A ∈ V