Computing FIRST() and FOLLOW() for a CFG

Let G = (V, Σ, S, P) be a context-free grammar (CFG). V is the set of nonterminal symbols, Σ is the set of terminal symbols, S ∈ V is the "start" symbol, and P ∈ V × (V ∪ Σ)^* is the set of productions. (A production (A, α) ∈ P is usually expressed in the form A ⟶ α.) For the sake of brevity, we use Γ = V ∪ Σ to refer to the set containing all terminal and nonterminal symbols of G. Also, we use Σ' to refer to Σ ∪ {$}. The $ symbol, which is assumed not to be a member of Γ, serves as an end-of-input-string marker.

This document describes algorithms by which to compute FIRST(Z) for all Z ∈ Γ and FOLLOW(A) for all A ∈ V. (We are not concerned with the follow sets of termainal symbols.)

Recall the following definition:

FIRST'(Z) = { t ∈ Σ | Z ⟹^* tα for some α ∈ Γ^* }

That is, if t ∈ Σ appears as the first (i.e., leftmost) symbol in a string derivable from Z, then t ∈ FIRST'(Z). (Of course, if Z is a terminal symbol, then FIRST(Z) = {Z}.

In the case that Z ⟹^* λ (i.e., Z is "nullable", which obviously implies Z ∈ V), FIRST(Z) = FIRST'(Z) ∪ {λ}. If Z is not nullable, then FIRST(Z) = FIRST'(Z).

As for FOLLOW():

FOLLOW(A) = { t ∈ Σ' | S$ ⟹^* αAtβ$ for some α,β ∈ Γ^* }

That is, if t ∈ Σ' appears immediately after A in a string derivable from S$, then t ∈ FOLLOW(A).

Having the values of FIRST(A) and FOLLOW(A) for every nonterminal symbol A in grammar G is vital in devising a one-symbol-lookahead stack machine that accepts L(G) or, what is really the same thing, devising a parse table for G that guides the standard top-down parsing algorithm. (Each cell in a parse table identifies which (if any) production(s) of the grammar are viable candidates to be applyed next, as a function of the next input symbol and the nonterminal currently on the top of the stack. (A "viable" production is one that, if applied next, could possibly lead to a successful parse of the input string. Whether or not it can lead to a successful parse depends upon the suffix of the input string that follows the next input symbol.)

For a CFG to qualify as being LL(1), there can be at most one viable production for each pair (A, b) ∈ V×Σ'.

Step 1: Identify the nullable nonterminals.

q := empty queue N := ∅ // N is the set of variables known to be nullable do for each λ-production A ⟶ λ | | N := N ∪ {A} | | q.enqueue(A) | fi od // At this point, all nonterminals that produce λ in one step // (i.e., via the application of a single λ-production) are in N // and on the queue. What follows is a loop to identify the // nonterminals from which λ is derivable by a sequence of // two or more applications of productions. do while !q.isEmpty() | B := q.dequeue(); | do for each production A ⟶ αBβ // in which B appears on RHS | | if A ∉ N ∧ every symbol X in αβ satisfies X ∈ N | | | N := N ∪ A | | | q.enqueue(A) | | fi | od od // At this point, N contains precisely the nullable nonterminals

Step 2: Calculation of FIRST()

As noted above, a nonterminal symbol A is said to be nullable iff A ⟹^* λ. Generalizing that, a string α = X₁X₂···X_k ∈ Γ^* is nullable iff X₁X₂···X_k ⟹^* λ. (This condition is equivalent to each X_i, 1≤i≤m, itself being nullable. A special case of this occurs when α is λ, which corresponds to the case in which k=0.)

The algorithm below exploits the fact that, if A ⟹ X₁X₂···X_m is a production, 1≤k≤m, and X₁X₂···X_k-1 is nullable, then FIRST'(A) ⊆ FIRST'(X_k). To confirm that, suppose that X_k ⟹^* tβ for some t ∈ Σ and β ∈ Γ^*, so that t ∈ FIRST'(X_k). Then

A ⟹ X₁X₂···X_m ⟹^* X_kX_k+1···X_m ⟹^* tβX_k+1···X_m

demonstrating that t ∈ FIRST'(A), too.

In the algorithm, variable first() is used as a proxy for the mathematical function FIRST(), the intent being that, upon completion of execution, first(A) = FIRST(A) for every A ∈ V.

// Let N be the set of nullable nonterminal symbols, // as computed by the algorithm described above. N := { A ∈ V | A ⟹^* λ } do for each t ∈ Σ | first(t) := {t} od do for each A ∈ V | first(A) := ∅ od boolean updateOccurred = true; do while updateOccurred | updateOccurred := false; | do for each non-λ production A ⟶ X₁X₂···X_m | | do for each k in [1..m] such that X₁···X_k-1 is nullable | | | if first(X_k) - first(A) ≠ ∅ then | | | | first(A) := first(A) ∪ first(X_k) | | | | updateOccurred := true | | | fi | | od | od od // At this point, first(Z) = FIRST'(Z) for every Z ∈ Γ do for each A ∈ N | first(A) := first(A) ∪ {λ} od // Now, first(Z) = FIRST(Z) for every Z ∈ Γ

The algorithm above describes a rather brute-force approach to computing FIRST(). An algorithm that would be more efficient (with respect to running time) appears in the appendix.

Step 3: Calculation of FOLLOW()

Above (and below, in the appendix) is shown an algorithm to compute FIRST : Γ ⟶ Σ ∪ {λ}. For the purposes of computing FOLLOW(), it is convenient to have extended versions of the functions FIRST' and FIRST whose domains include not just Γ (individual symbols) but also strings of length two or more. The definitions of the extended functions are

FIRST'^*(Z₁Z₂··Z_r) = { t ∈ Σ | Z₁Z₂··Z_r ⟹^* tα for some α }

If every Z_i (1≤i≤r) is nullable, then FIRST^*(Z₁Z₂··Z_r) = FIRST'^*(Z₁Z₂··Z_r) ∪ {λ}

Here is a method for computing FIRST^*:

function first^*(Z₁··Z_r) : // Let N be the set of nullable nonterminal symbols, // as computed by the algorithm described above. result := FIRST'(Z₁) j := 1 do while (j < r ∧ Z_j ∈ N) | result := result ∪ FIRST'(Z_j+1) | j := j+1 od if j = r ∧ Z_r ∈ N then | result := result ∪ {lambda;} fi return result

Making use of the first^*() method, we can compute the FOLLOW() function:

// Let N be the set of nullable nonterminal symbols, // as computed by the algorithm described above. N := { A ∈ V | A ⟹^* λ } do for each A ∈ V - {S} | follow(A) := ∅ | follow^-1(A) := ∅ od follow(S) := { $ } do for each production A ⟶ X₁··X_m | do for each k ∈ [1..m) | | if X_k ∈ V | | | F := first^*(X_k+1··X_m) | | | follow(X_k) := follow(X_k) ∪ (F - {λ}) | | | if λ ∈ F then | | | | follow^-1(A) := follow^-1(A) ∪ {X_k} | | | fi | | fi | od | if X_m ∈ V | | follow^-1(A) := follow^-1(A) ∪ {X_m) | fi od // At this point, for every B ∈ V, follow(B) includes // every t ∈ Σ such that there exists a production // A ⟶ αBβ where β ⟹^* tφ for some φ. // Assuming (as we are) that every nonterminal symbol is // useful, this condition implies the existence of the // derivation S ⟹^* γAη ⟹ γαBβη ⟹^* γαBtφη // demonstrating that t ∈ FOLLOW(B). // Meanwhile, for every A ∈ V, follow^-1(A) includes every // nonterminal X_j such that for some production A ⟶ X₁··X_m, // X_j+1··X_m ⟹^* λ, implying that FOLLOW(A) ⊆ FOLLOW(X_j). // To demonstrate this, suppose that t ∈ FOLLOW(A). // Then there is a derivation // S$ ⟹^* αAtβ ⟹ αX₁··X_mtβ ⟹^* αX₁··X_jtβ // Hence, t ∈ FOLLOW(X_j), too. // Now resolve all the FOLLOW(A) ⊆ FOLLOW(B) relationships // indicated by follow^-1: q := empty queue do for each A ∈ V | if follow^-1(A) ≠ ∅ | | q.enqueue(A) | fi od do while !q.isEmpty() | A := q.dequeue() | do for each B ∈ follow^-1(A) | | if follow(A) - follow(B) ≠ ∅ then | | | follow(B) := follow(B) ∪ follow(A) | | | if !q.inQueue(B) then | | | | q.enqueue(B) | | | fi | | fi | od od // At this point, follow(A) = FOLLOW(A) for all A ∈ V.

Appendix: A Better way to compute FIRST()

The algorithm described earlier for computing the FIRST() function aimlessly iterates through every production in G "hoping" to find one whose left-hand side's first() value should be updated to include one or more new terminal symbols. Only after an unproductive iteration through all the productions does it recognize that there is nothing more to be done.

A better algorithm would, upon identifying a B ∈ V whose first() value needs to be updated would, after making that update, direct its attention to those nonterminals A ∈ V such that FIRST'(B) ⊆ FIRST'(A) by virtue of the fact that there is a production A ⟶ αBβ where α is nullable. (If first(A) does not include all the members of first(B), then first(A) needs to absorb all those members!) The algorithm below does this.

// Let N be the set of nullable nonterminal symbols, // as computed by the algorithm described above. N := { A ∈ V | A ⟹^* λ } do for each t ∈ Σ | first(t) := {t} | first^-1(t) := ∅ od do for each A ∈ V | first(A) := ∅ | first^-1(t) := ∅ od do for each non-λ production A ⟶ X₁X₂···X_m | do for each k in [1..m] such that X₁X₂···X_k-1 is nullable | | first^-1(X_k) := first^-1(X_k) ∪ {A} | od od // At this point, first^-1(X) = { A ∈ V | A ⟶ αXβ for some nullable α }. // Significance: for each A ∈ first^-1(X), FIRST'(A) ⊆ FIRST'(X) do for each t ∈ Σ | do for each A ∈ first^-1(t) | | first(A) := first(A) ∪ {t} | od od // At this point, for every A ∈ V, first(A) includes every t ∈ Σ // such that A ⟶ αtβ is a production and α is nullable. q := empty queue do for each B ∈ V | if first^-1(B) ≠ ∅ | | q.enqueue(B) | fi od // At this point, the queue includes every B ∈ V for which there // exists some A ∈ V such that FIRST'(B) ⊆ FIRST'(A) by virtue // of there being a production A ⟶ αBβ, where α is nullable. do while !q.isEmpty() | B := q.dequeue() | do for each A ∈ first^-1(B) | | if first(B) - first(A) ≠ ∅ | | | first(A) := first(A) ∪ first(B) | | | if first^-1(A) ≠ ∅ ∧ !q.inQueue(A) | | | | q.enqueue(A) | | | fi | | fi | od od // At this point, first(A) = FIRST'(A) for every A ∈ V do for each A ∈ N | first(A) := first(A) ∪ {λ} od // Now, first(A) = FIRST(A) for every A ∈ V