SE 507
Notes on Bertrand Meyer's On Formalism in Specifications from January 1985 issue of IEEE Software (pp. 6-26)

Note that the full text of this paper is available at the IEEE Computer Science Digital Library, a hyperlink to which can be found on the U of Scranton Library's list of databases.

Overview

Specification is the phase of the software lifecycle concerned with precise definition of the tasks to be performed by the system. (Notice on page 7 the figure depicting Royce's waterfall model of the software life cycle, the phases of which are requirements, specification, design ("global" and then "detailed"), implementation, validation, distribution, and operation.)

Although SE textbooks emphasize its importance, in practice the specification phase is often overlooked, being confused with either the preceding phase, definition of system objectives (during which a natural-language requirements document is produced), or the following phase, design.

In the former case, the requirements document is deemed sufficient to proceed to system design without further specification activity.

Meyer's paper emphasizes the drawbacks of such an informal approach and attempts to show the usefulness of formal specifications as a complement to, not a replacement for, natural-language requirements. It also attemps to show how a formal specification can be used to improve/clarify the natural-language descriptions of requirements.

Seven Sins of the Specifier

In natural-language requirements, one can find recurring patterns/classes of deficiencies, or "sins". Among the most common and damaging are

  1. Noise: Presence in the text of an element that fails to carry information relevant to any feature of the problem. Variants include
  2. Silence: Existence of a feature of the problem that is not covered by any element of the text.
  3. Overspecification: Presence in the text of an element that corresponds not to a feature of the problem but to features of a possible solution.
  4. Contradiction: Presence in the text of two or more elements that define a feature of the system in incompatible ways.
  5. Ambiguity: Presence in the text of an element making it possible to interpret a feature of the problem in at least two different ways.
  6. Forward Reference: Presence in the text of an element that uses features of the problem not defined until later in the text.
  7. Wishful Thinking: Presence in the text of an element that defines a feature of the problem in such a way that a candidate solution cannot realistically be validated with respect to this feature.

This classification is interesting for at least two reasons:

Question: Suppose that writers of natural-language requirements were to stop committing such sins and write only requirements documents of very high quality. Would this solve the problem?

Answer: Meyer thinks not! In his view, a natural-language description of any significant system, even a description of good quality, exhibits deficiencies making it unacceptable for rigorous software development.

Illustration of a Particular Requirements Document

To illustrate the point, Meyer chooses a very simple text-formatting problem that was described (in natural language) in a 1969 paper by Peter Naur (Programming by Action Clusters, BIT, Vol. 9, No. 3, 1969, pp. 250-258.) The main point of Naur's paper was to present an algorithm that solves the problem and to prove the algorithm's correctness.

Naur's description was as follows:

Given a text consisting of words separated by BLANK or NEWLINE (newline) characters, convert it to a line-by-line form in accordance with the following rules:
  1. line breaks must be made only where the given text has BLANK or NEWLINE;
  2. each line is filled as far as possible, as long as
  3. no line will contain more than MAXPOS characters.

Goodenough and Gerhart (henceforth, G&G) subsequently wrote two papers about program testing that addressed Naur's problem, criticizing not only his description but also his (very flawed) solution.

The historical backdrop is that G&G were defending the notion of testing as a useful technique, in oppostion to those (e.g., Dijkstra) who put more emphasis on proving the correctness of programs. (Dijkstra famously said in his 1972 ACM Turing Award lecture, "Testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence.")

G&G not only found several deficiencies in Naur's problem description, but also found that his solution had major flaws —including that it would terminate only if the input data were invalid (in a particular way)! This demonstrated that even a program that had been "proved correct" could be incorrect! (This is not a contradiction; rather, it reminds us that proofs can possess errors, too.)

G&G offered an improved (but much longer) description of the problem in their first paper ("Towards a Theory of Test Data Selection", by J.B. Goodenough and S. Gerhart, IEEE Transactions on SE, Vol. Se-1, No. 2, June 1975, pp. 156-173). In their second paper ("Towards a Theory of Test: Data Selection Criteria", in Current Trends in Programming Methodology, Vol. 2, edited by R.T. Yeh, Prentice-Hall, 1977, pp. 44-79.), they acknowledge that their improvement of Naur's description still left something to be desired, so they gave yet another one, which appears in Figure 2, page 11, of Meyer's paper.

Analysis of Goodenough's & Gerhart's Specification

Meyer's first observation is that G&G's specification is four times as long as Naur's (resulting, no doubt, from their efforts to "leave no stone unturned" and to eliminate all ambiguity), and seems inappropriately lengthy for such a simple problem.

Meyer then goes on to point out several examples of where G&G commit six of the seven "sins" described earlier.

Noise: Noise isn't always bad; sometimes it can play the same role in a specification as comments do in a program. But often, noise elements obscure the text in that a reader, upon first encountering such an element, thinks it brings new information, but upon closer examination realizes that it only repeats known information in a new way.

Remorse: This is a variant of noise in which a term is used, but qualified in order to restrict or modify its earlier definition, as though the author suddenly regretted the initial definition.

Silence: Often a specifier will fail to address some vital features of the problem, or address them inadequately.

Contradictions: arise from elements of the text that result in incompatible interpretations.

Overspecification: The reader is told too much about a possible solution. Programmers, understandably, tend to make this mistake in writing requirements documents.

Ambiguities:

Forward References:

So what??

If great care can be taken to describe such a simple problem, and it still comes out bad, imagine how much more difficult it is to give a good description of a complicated problem, possibly one related to something that puts lives and/or property at stake, like nuclear reactor control, or missile guidance, or even payroll.

In Meyer's opinion, the situation can be improved significantly by a reasoned use of more formal specifications, which would serve as a complement to (but not a replacement for) natural-language documents. Indeed, one can often use a formalized specification (and the insights that arose during its development) to formulate a better natural language version.


Elements for a formal specification

Most languages/notations for expressing specifications formally are based upon well-known mathematical concepts such as sets, functions, relations, and sequences. So, rather than choose any particular formal specification language (e.g., Z, B, Larch), Meyer uses traditional mathematical notation (for sets, functions, etc.) to develop a formal specification for Naur's text formatting problem.

There are essentially three aspects to solving this problem:

  1. reducing each break (in the input text) to a break of length one (in the output text)
  2. ensuring that no "line" (in the output text) exceeds MAXPOS characters
  3. filling each "line" (in the output text) as much as possible

It will simplify matters to think of these three semi-independently.

Note: Meyer seems to be guilty here of using language that is suggestive of a method of solution, when he should be describing (only) the desired relationship between input text and output text, as well as any additional conditions that the output text must meet. End of note.

As for the first item, (informally) define the binary relation short_breaks ⊆ seq[CHAR] × seq[CHAR] by

short_breaks ::= { (x,y) | y can be obtained from x by removing break characters until each break has length one }

Recall that a break, within a sequence of characters (i.e., a value of type seq[CHAR]), is a maximal (contiguous) subsequence of characters in the set BREAK_CHAR = { BLANK, NEWLINE }.

As for the second item, (informally) define the binary relation limited_length ⊆ seq[CHAR] × seq[CHAR] by

limited_length ::= { (x,y) | no "line" in y exceeds length MAXPOS   ∧   y can be obtained from x by replacing zero or more occurrences of NEWLINE by BLANK and zero or more occurrences of BLANK by NEWLINE }

If we take the relation product/composition limited_length º short_breaks, we get the set

{ (x,z) | z can be obtained from x by replacing each break in x by a break of length one   ∧   no "line" in z exceeds length MAXPOS }

Note that the first conjunct says not only that each break in z is of length one but also that the sequence of "words" (those contiguous subsequences of characters appearing between breaks!) in z corresponds to those in x. And the latter is the relationship that we want between input text and output text!

Note: Different authors use different notations for the relation product/composition operation. Here I have used notation consistent with Meyer. However, Gries & Schneider would have written it with the two operands in the opposite order. (See Chapter 14 of their book.) End of note.

Note: Had we strengthened the definition of short_breaks to say that y is obtained by replacing every break in x by a single occurrence of BLANK, then we could have simplified the description of limited_length by omitting the part allowing NEWLINEs being replaced by BLANKs. End of Note.

For a relation R⊆A×B and x∈A, define R.x = { y | (x,y)∈R }.

Note: Using a slightly generalized definition of relation composition (from that usually found in textbooks), what we are here calling R.x is just R º {x}.
End of note.

Let's give the name ll_sb to the relation limited_length º short_breaks. Then, for x∈seq[CHAR],

ll_sb.x = { z | (x,z) ∈ ll_sb }

is the set of texts that can be obtained by replacing breaks in x by breaks of length one and ensuring that no "line" in the resulting text has a length exceeding MAXPOS.

Now consider the function FEWEST_LINES: P(seq[CHAR]) → P(seq[CHAR]), which, given a set X of texts, yields the subset of X containing precisely those texts having the fewest number of "lines". (The number of lines in a text could be defined as being one more than the number of occurrences of NEWLINE in that text.) Note that P(A) denotes the set of all subsets of A. Hence, a function having domain P(A) is one that "expects" to be given an argument that is a set whose members are elements of A, and a function having P(A) as its range is one that, when applied to an element of its domain, yields as a result a set whose members are elements of A.

Then FEWEST_LINES(ll_sb.x) is the set of texts that

  1. can be obtained by replacing breaks in x by breaks of length one,
  2. have no lines exceeding length MAXPOS, and
  3. have the fewest number of lines among all texts satisfying the previous two conditions.

According to Meyer, for any input text x, any member of FEWEST_LINES(ll_sb.x) is an acceptable output text. (Do you agree? If not, where is the flaw in Meyer's thinking?)

That is, the relation goal ⊆ seq[CHAR] × seq[CHAR] that contains precisely those pairs (x,y) such that y is an acceptable output for input x is

goal ::= { (x,y) | y ∈ FEWEST_LINES(ll_sb.x) }

What we have done so far is to give a semi-formal specification of the problem.


A More Formal Specification

Sequences and Subsequences

Meyer's formal specification relies very much upon the concepts of sequence and subsequence, which he defines on page 19 in a manner similar to how they are defined in the specification language Z.

Definition: A sequence S of elements of type A having length n is a function with domain 1..n (the set of natural numbers {1,2,...,n}) and range A. For example, the sequence

S = < spock, kirk, gorn, spock, uhura, mccoy >

is a function with domain 1..6 and range

star_trek_character = { kirk, spock, mccoy, uhura, gorn, sulu, checkov, ... }

such that S(1) = spock, S(2) = kirk, ..., S(6) = mccoy.

The set of sequences of elements of type A is denoted by seq[A]. Our problem is with respect to an input text and an output text, each of type seq[CHAR].

Informally, if x,y ∈ seq[A] and y can be obtained by "erasing" zero or more elements from x, we say that y is a subsequence of x. For example,

T = < kirk, gorn, spock, mccoy >

is a subsequence of S that was obtained by erasing the first and fifth elements of S.

Formally, y is a subsequence of x if there exists an increasing function f : 1..m → 1..n (i.e., a sequence of natural numbers!), where length(y) = m and length(x) = n, such that, for all i in 1..m, y(i) = x(f(i)). That is,

isSubsequenceOf(y,x) ::= (∃f : 1..length(y) → 1..length(x) |: (∀i | 1<i≤length(y) : f(i-1) < f(i)) ∧ (∀i | 0<i≤length(y) : y(i) = x(f(i))))

In our S/T example, such a function is f(1) = 2, f(2) = 3, f(3) = 4, f(4) = 6.

Following Meyer's approach, define the (family of) function(s) SUBSEQUENCE : seq[A] → P(seq[A]) by

SUBSEQUENCE(x) ::= { y |: isSubsequenceOf(y,x) }

Defining short_breaks Formally

Define SINGLE_BREAKS : seq[CHAR] → P(seq[CHAR]) as follows:

SINGLE_BREAKS(x) ::= { y ∈ SUBSEQUENCE(x) | (∀i | 1<i≤length(x) : y(i-1) ∈ BREAK_CHAR  ⇒  y(i) ∉ BREAK_CHAR) }

In other words, y ∈ SINGLE_BREAKS(x) iff y is a subsequence of x having breaks of length one.

The trouble with SINGLE_BREAKS(x) is that it includes subsequences of x obtained by removing not just "extra" break characters, but also non-break characters or entire breaks (so that adjacent words have joined to become one word)! What we really want (in order to realize the short_breaks relation) are all members of SINGLE_BREAKS(x) having maximum length. (Such texts must have been obtained without erasing any non-break characters and without erasing entire breaks.)

To achieve this, Meyer uses the function MAX_SET, which takes as arguments a set A of texts and a function f (where f maps texts to numbers) and yields that subset of A containing precisely those members having maximum value when f is applied to them. That is,

MAX_SET(A,f) ::= { x∈A | f(x) = (max z | z∈A : f(z)) }

This gives rise to the definition

COMPACTED(x) ::= MAX_SET( SINGLE_BREAKS(x), length)

which yields the set containing all subsequences of x obtained by erasing from x only "extra" break characters (so that no break is erased entirely and no non-break characters are erased at all).

Now we can define the relation short_breaks:

short_breaks(x,y) ::= y ∈ COMPACTED(x)

Defining limited_length Formally

Define EQUIVALENT ⊆ seq[CHAR] × seq[CHAR] by

EQUIVALENT ::= { (u,v) | length(u) = length(v) ∧ (∀i | 1≤i≤length(u) : u(i) ≠ v(i)  ⇒  u(i)∈BREAK_CHAR ∧ v(i)∈BREAK_CHAR) }

This says that (u,v) ∈ EQUIVALENT holds iff u and v are identical, except where one has an occurrence of NEWLINE, the other may have an occurrence of BLANK.

Define noLinesLongerThan : seq[CHAR] × Z → BOOLEAN informally by

noLinesLongerThan(u,k) ::= true if every substring of u of length k+1 includes at least one occurrence of NEWLINE, false otherwise.

It is left to the reader to provide a formal definition.

Now we can formally define limited_length:

limited_length ::= { (u,v) ∈ EQUIVALENT | noLinesLongerThan(v,MAXPOS) }

What mostly remains is to give a formal definition of FEWEST_LINES:

FEWEST_LINES(A) = MIN_SET(A, #new_lines)

where

#new_lines(u) = (#i | 1≤i≤ length(u) : u(i) = NEWLINE)

Improved Natural Language Specification

Using the insights gained from developing the formal specification, Meyer offers an improved informal specification in Figure 5 (and modified slightly here by the instructor):

Given are a nonnegative integer MAXPOS and a character set including two "break" characters, BLANK and NEWLINE. A substring is defined to be a contiguous subsequence of a sequence of characters. A break is defined to be a maximal substring containing only break characters. (Here, "maximal" means that any character preceding or following the substring is a non-break character.)

The program shall accept as input a finite sequence of characters and produce as output a sequence of characters satisfying the following conditions: