A relational database is composed of two-dimensional tables. (A table can also be called a relation, although relational "purists" would argue that there is a subtle distinction between the two.) Each "row" of a table is called a tuple. Each tuple is composed of fields, one for each attribute of the table. (The attributes are the concepts, or the names thereof, that we associate with the fields/columns.)
Here is an example of a table in which each tuple describes a student in a hypothetical high school or 4-year college.
Student ID Name Sex Class +------+----------+-----+-------+ | 2 | Mary | F | 2 | +------+----------+-----+-------+ | 6 | John | M | 4 | +------+----------+-----+-------+ | 9 | Carol | F | 3 | +------+----------+-----+-------+ | 7 | Mary | F | 4 | +------+----------+-----+-------+ | 1 | Ann | F | 1 | +------+----------+-----+-------+ | 13 | Mike | M | 2 | +------+----------+-----+-------+ | 8 | Helen | F | 4 | +------+----------+-----+-------+ | 5 | Jim | M | 1 | +------+----------+-----+-------+ | 16 | Mike | M | 3 | +------+----------+-----+-------+ |
Following the usual convention, we have written the name of the table, Student, immediately to the left of the attributes, which serve as column headings. As the reader has probably surmised, we have chosen to encode the values corresponding to male and female, respectively, by the characters M and F. Also, the values 1, 2, 3, and 4 occurring in the Class attribute correspond to freshman, sophomore, junior, and senior, respectively.
The reader should keep in mind that the order in which we list the tuples occurring in a table is irrelevant. That is, a table is viewed as being a set of tuples, not a sequence of tuples.
One of the central functions of a DBMS is to answer questions (about the data in the database) posed to it by users. (Indeed, there would be no reason to store the data in the first place if there were no desire to retrieve bits and pieces of it later for the purpose of answering such questions.) Such a question is referred to as a query. For example, with respect to the table above, a user might ask What are the names of the female students? Or, to state it in the form of a command: List the names of the female students. More examples: List the names of female freshmen; List the name and sex of each senior. In the relational database model, the answers to queries such as these are themselves tables (although such tables do not become a "permanent" part of the database). For example, the answer to the last query mentioned would be:
Name Sex +----------+-----+ | John | M | List the name and sex +----------+-----+ of each senior. | Mary | F | +----------+-----+ | Helen | F | +----------+-----+ |
Notice that this table is obtained from the Student table by first eliminating all but those tuples corresponding to seniors (i.e., tuples whose Class attribute has value 4) and then eliminating all but the Name and Sex attributes. (You will see shortly that the elimination of tuples and attributes correspond to two of the fundamental operations used for answering queries.)
Because computers' abilities to "understand" natural language (such as English) is very limited (at least to this point in time), a user must, in specifying a query, employ a more precise and formal notation. Most relational DBMS's support the use of some variant of SQL (Standard Query Language) for specifying queries. Here we shall use a slightly lower-level language, relational algebra, for that purpose. Conceptually, one can view the first phase of query processing —in which a DBMS translates an SQL query submitted to it into a program that produces the answer to that query— as having as its purpose to translate the SQL query into an equivalent relational algebra query.
We shall present two different forms of syntax for relational algebra (RA) queries. One is the traditional syntax, which has a mathematical flavor. The other has a flavor more similar to programming languages, or even SQL. The latter we will refer to as sugared relational algebra (SRA).
Queries in relational algebra are based upon the use of three elementary operations on tables: project, restrict, and join. (The "restrict" operation is usually called "select", but here we use the terminology of C.J. Date (prolific author on the subject of the relational model), in part because the SELECT verb in SQL has an entirely different meaning.)
Because the result of applying an operation is itself a table, we can compose operations in sequence. The obvious analogy is with functions mapping reals to reals, where the function (f o g), read "f of g", is defined by (f o g)(x) = f(g(x)). Indeed, the operators in relational algebra are functions, with tables as both domain and range.
Applying project to a table yields a copy of that table, but (possibly) with some of its attributes (i.e., columns) excluded. (Presumably, the excluded attributes are not of interest.) SRA's syntax for an application of the project operation is as follows:
In RA, this is written
For example, the SRA expression
Name Class +--------+-----+ | Mary | 2 | +--------+-----+ | John | 4 | PROJECT Name, Class +--------+-----+ FROM Student | Carol | 3 | +--------+-----+ | Mary | 4 | +--------+-----+ | Ann | 1 | +--------+-----+ | Helen | 4 | +--------+-----+ | Jim | 1 | +--------+-----+ | Mike | 3 | +--------+-----+ |
The query PROJECT Name, Class FROM Student, then, is nothing but a more formal way of stating the (English) query
(Note that Elmasri & Navathe (and almost everyone else, with the exception
of C.J. Date) calls this select, but we refrain from doing so
because the meaning of that term in SQL is quite different.)
Applying restrict to a table yields a copy of that table, but
(possibly) with some of its tuples (i.e., rows) excluded, namely those
tuples that fail to satisfy a specified condition.
(Presumably, any tuple failing to satisfy the condition is not of interest
to the user.)
In keeping with the syntactic style used for project, an application
of the restrict operation will be written in SRA in the following
way:
The condition is simply a boolean expression to be evaluated with respect
to a tuple. (Hence, any sensible condition will mention one or more
attributes of the table.)
For example, the expression
ID Name Sex Class +------+----------+-----+-------+ | 6 | John | M | 4 | +------+----------+-----+-------+ | 13 | Mike | M | 2 | RESTRICT Student +------+----------+-----+-------+ WHERE Sex = 'M' | 5 | Jim | M | 1 | +------+----------+-----+-------+ | 16 | Mike | M | 3 | +------+----------+-----+-------+ |
By using boolean (or logical, if you prefer)
operators (such as AND, OR, and NOT)
in the condition, we can express more complicated queries.
For example, if we wanted to form a table like Student,
except listing only students who are either (a) sophomore and male or
(b) female, we could write
With only the project and restrict operations, we have
the ability to describe many non-trivial queries. For example, suppose
that we wanted a list containing the name and sex of every senior.
From the examples above, it should be clear that the expression
In RA, this would be
This example illustrates that one query can be "nested" inside another. Indeed, as the RA syntax suggests, nesting one query inside another is nothing more than function composition, as alluded to earlier.
As another example, suppose we wanted a list containing the name and sex of every student who is either a male sophomore or else a senior. An appropriate SRA query is
PROJECT Name, Sex FROM (RESTRICT Student WHERE (Sex = 'M' AND Class = 2) OR (Class = 4) ) |
As queries get more complicated, requiring the use of two, three, or more nested applications of project and/or restrict, they become difficult to understand. Hence, we allow "temporary" tables to be given names. For example, the query above describing the list of names and sexes of students who are either male sophomores or seniors (of either sex) can be written as
(1) Temp ← RESTRICT Student WHERE (Sex = 'M' AND Class = 2) OR (Class = 4) (2) Result ← PROJECT Name, Sex FROM Temp
Here, in line (1) we specify that the table obtained from doing the restrict is to be named (rather unimaginatively) Temp, and then we apply project to that table in line (2) in order to obtain the desired result, to which we give the name Result!
In any but the most trivial relational databases there will be two, three, or possibly many more tables. In such a setting, to answer most interesting queries will require the use of two or more tables. This is where the join operation becomes useful.
Suppose that, in addition to Student (as illustrated above), our database also includes a table whose purpose is to keep a record of which students are currently enrolled in which courses. (Such a table would arise from there being a many-to-many relationship type involving student and course entities.) Let us call this the Enrolled-In table; for the purposes of doing examples, suppose that its current contents are as follows:
Enrolled-In StuID CourseID +-----+------------+ | 2 | CIL 102 | +-----+------------+ | 2 | MATH 2 | +-----+------------+ | 1 | HIST 7 | +-----+------------+ | 7 | MATH 2 | +-----+------------+ | 1 | ENGL 4 | +-----+------------+ | 1 | MATH 2 | +-----+------------+ | 13 | HIST 7 | +-----+------------+ | 5 | HIST 7 | +-----+------------+ | 5 | CIL 102 | +-----+------------+ |
Note justifying the addition of such a table: (This can be skipped by the uninterested reader.) It is true that the information represented by the Enrolled-In table could be embedded within the Student table, assuming that we added an attribute for CourseID to that table. However, to do so would result in the duplication of much data, because, for each course a particular student was enrolled in, her ID, Name, Sex, and Class values would have to be repeated. For example, the fact that Mary was enrolled in both CIL 102 and MATH 2 would require that both of these rows/tuples appear in our expanded Student table:
ID Name Sex Class Enrolled-In +------+----------+-----+-------+-------------+ | 2 | Mary | F | 2 | CIL 102 | | 2 | Mary | F | 2 | MATH 2 | +------+----------+-----+-------+-------------+ |
Duplication of data is undesirable for at least two reasons. One is that it takes more storage space than necessary. Another is that it makes the problem of maintaining "data integrity/consistency" much more difficult. Suppose, for example, that Mary moves on to Class 3. Then it becomes necessary to modify every tuple storing data regarding Mary. If, for some reason, some tuples are modified but others are not, we have an inconsistency in the data.
To avoid the need to maintain multiple tuples representing Mary, you might suggest that the Enrolled-In field be of type set of string, rather than just string. That way, the value { "CIL 102", "MATH 2" } could be assigned to the Enrolled-In field of the tuple representing Mary, indicating that the two courses in which Mary is enrolled are CIL 102 and MATH 2
This is a reasonable suggestion; however, for technical reasons it has
traditionally been a rule of the relational model that all attributes
must be viewed as being atomic, meaning, for example, that if an attribute
has a value that is a set, its individual members cannot be
extracted/identified by any operation of the relational model.
In the case of Mary's courses, this would not be acceptable, because it
would prevent us, for example, from using relational operations
that could help us answer a query such as
"List the names of all students enrolled in the course with ID
C/IL 102."
End of note.
Consider the following informal query:
A somewhat more formal way of saying this is
Produce a list of all ordered pairs (x,y) satisfying the condition that x is (the name of) a student and y is (the course ID of) a course in which x is enrolled.
The correct result would be a table containing the tuples in Enrolled-In, except with each student ID replaced by the corresponding student's name. Clearly, such a table cannot be constructed from the Student and Enrolled-In tables using only the project and restrict operations. (Indeed, neither of those operations allows us to construct a table whose tuples are formed by combining tuples from two (or more) different tables. But that's exactly what we need here, because all information about student names is in the Student table whereas all information about course ID's is in the Enrolled-In table.)
Consider how you might go about constructing an answer to this query. Most likely, you would scan through the tuples of Enrolled-In; for each such tuple e you would note the value of e[StuID] (i.e., the value in its StuID field) and then scan through the tuples of Student in search of the tuple s satisfying s[ID] = e[StuID] (i.e., having a matching value in its ID field). Upon finding it, you would place into the table under construction the tuple <s[Name],e[CourseID]>.
The crucial concept here is that of combining a tuple in one table with a matching tuple in another table. (In our example, what made one tuple match another was having the same value in their ID and StuID attributes, respectively. In a different query, the matching criterion would be different.)
This capability of combining matching tuples from two different tables
is precisely what the join operation provides. Specifically,
the SRA expression
yields as its value the table obtained by combining every pair of tuples (from the two indicated tables) that satisfy the join condition. (Each atomic sub-expression within a join condition must compare an attribute in one table to an attribute in the other table.)
In RA, the join operator is a bowtie symbol, but HTML has no similar symbol. So we will use the × operator, and put the join condition as a subscript to its right. Hence, we will write a join in RA as
To illustrate this idea, consider these two tables:
Table A Table B V W X Y Z +---+---+ +---+---+---+ | r | 2 | | 5 | g | p | | t | 4 | | 4 | d | e | | p | 6 | | 2 | m | q | +---+---+ | 4 | t | f | +---+---+---+ |
The result of the expression JOIN A WITH B WHERE W = X (or, in the RA notation, A ×W=X B) is the table
V W X Y Z +---+---+---+---+---+ | r | 2 | 2 | m | q | | t | 4 | 4 | d | e | | t | 4 | 4 | t | f | +---+---+---+---+---+ |
Here we combined a tuple t in A with a tuple u in B iff t[W] = u[X]. This (i.e., comparing two attributes for equality) is the most common kind of join condition. However, it's not the only kind. Consider
The resulting table consists of all those combinations of tuples t and u from A and B, respectively, in which t[W] < u[X]:
V W X Y Z +---+---+---+---+---+ | r | 2 | 5 | g | p | | r | 2 | 4 | d | e | | r | 2 | 4 | t | f | | t | 4 | 5 | g | p | +---+---+---+---+---+ |
In case it is necessary to qualify an attribute by specifying in which table it is found (such as would be the case if the join condition involved an attribute name that occurred in both tables being joined), we can qualify it by its table name, as in JOIN A WITH B WHERE A.W = B.X (or, in the RA notation, A ×A.W=B.X B).
Returning to our problem of devising a query that asks for the names of students and the ID's of courses in which they are enrolled, our first step is to employ the join operation as follows:
JOIN Enrolled-In WITH Student WHERE StuID = IDThe resulting table would be
StuID CourseID ID Name Sex Class +-----+------------+------+----------+-----+-------+ | 2 | CIL 102 | 2 | Mary | F | 2 | +-----+------------+------+----------+-----+-------+ | 2 | MATH 2 | 2 | Mary | F | 2 | +-----+------------+------+----------+-----+-------+ | 1 | HIST 7 | 1 | Ann | F | 1 | +-----+------------+------+----------+-----+-------+ | 7 | MATH 2 | 7 | Mary | F | 4 | +-----+------------+------+----------+-----+-------+ | 1 | ENGL 4 | 1 | Ann | F | 1 | +-----+------------+------+----------+-----+-------+ | 1 | MATH 2 | 1 | Ann | F | 1 | +-----+------------+------+----------+-----+-------+ | 13 | HIST 7 | 13 | Mike | M | 2 | +-----+------------+------+----------+-----+-------+ | 5 | HIST 7 | 5 | Jim | M | 1 | +-----+------------+------+----------+-----+-------+ | 5 | CIL 102 | 5 | Jim | M | 1 | +-----+------------+------+----------+-----+-------+ |
Note that each tuple's first two columns come from a tuple in Enrolled-In and its last four columns come from a matching tuple in Student.
From the table immediately above (resulting from the application of join), we get the desired result simply by omitting every column except those corresponding to Name and CourseID. To do this, we apply project. Hence, the SRA query that we want is
PROJECT Name, CourseID FROM (JOIN Enrolled-In WITH Student WHERE StuID = ID )
Suppose that the ID attribute in the Student table had been named StuID instead, corresponding to the name of the attribute in Enrolled-In. Then the join operation performed above would have been written (in SRA)
JOIN Enrolled-In WITH Student WHERE StuID = StuID
This raises a rather sticky issue: In the WHERE clause, one occurrence of StuID refers to the attribute in Enrolled-In whereas the other refers to the same-named attribute in Student. But how can we know that? One answer is to require that, in any comparison of attributes in a join condition, the first (respectively, second) one mentioned must be from the first (resp., second) table mentioned (i.e., the one preceding (resp., following) WITH). (Since every join condition should be composed of such comparisions, possibly combined using boolean operators such as AND and OR) this rule makes sense.)
Another way to resolve the issue is to require that attribute names be qualified where necessary by prefixing them with the name of the table. Using that approach, our join operation would have been written like this:
JOIN Enrolled-In WITH Student WHERE Enrolled-In.StuID = Student.StuID
This resolves the naming issue as it pertains to forming/interpreting this particular query, but it raises a new naming issue with respect to the table that results from evaluating the query: that table will have two different attributes named StuID.
Ah, but this is easily remedied by omitting one of those attributes! After all, in each tuple, the values in the two same-named attributes will be the same. (The join condition guarantees this.)
Indeed, this situation arises so often in doing join operations that a special version of join has been defined, called the natural join, and the corresponding operator is the asterisk, *. So the result of evaluating
Does this completely resolve the issue of conflicting (or duplicate) attribute names? No!
Suppose that we wanted to join Student with itself, with the join condition stipulating that two tuples should be combined in the case that their StuID values were different but their Name attributes were the same. That is, we want to say something like
JOIN Student WITH Student WHERE StuID != StuID AND Name = Name
Even if we ignore any potential ambiguities with respect to attribute names in the query, the problem is that the resulting table must have two attributes named StuID because, in each tuple, their values will be different! (And hence we cannot simply merge the two columns into one.)
To resolve this, we introduce the notions of aliasing and (locally) renaming attributes.
Returning to the problem of finding students who are either female or else both male and in the sophomore class, we could have written the solution as follows:
Females ← RESTRICT Student WHERE Sex = 'F' Males ← RESTRICT Student WHERE Sex = 'M' Sophomores ← RESTRICT Student WHERE Class = 2 Result ← Females ∪ (Males ∩ Sophomores)
For another example, suppose that we wanted to produce the list of students who are not enrolled in CIL 102. Your first inclination might be to produce a table containing the ID's of students enrolled in a course other than CIL 102 and then to join that with Student in order to obtain the corresponding names:
(1) Non_CIL102 ← PROJECT StuID FROM (RESTRICT EnrolledIn WHERE CourseID ≠ 'CIL 102') (2) Result ← PROJECT Name FROM (JOIN Student WITH Non_CIL102 WHERE Student.ID = Non_CIL102.StuID)But this is incorrect, as it will yield the name of every student who is enrolled in some course other than CIL 102. In particular, students not enrolled in any courses will be (incorrectly) omitted and students enrolled in CIL 102 as well as at least one other course will be (incorrectly) included.
A correct approach would be to produce a table containing the ID's of all students and another table containing the ID's of students enrolled in CIL 102, and then to take the difference of the two, which will be a table including precisely the ID's of students not enrolled in CIL 102. This idea leads to the query
(1) All_ID ← PROJECT ID FROM Student (2) CIL102_ID ← PROJECT StuID AS ID FROM (RESTRICT EnrolledIn WHERE CourseID = 'CIL 102') (3) Non_CIL102 ← All_ID - CIL102_ID (4) Result ← PROJECT Name FROM (JOIN Student WITH Non_CIL102 WHERE Student.ID = Non_CIL102.ID)
Notice that in step (2) we projected on the StuID attribute but we specified (via the clause AS ID) that that attribute should be named ID in the resulting table. Having done that, we are free in step (3) to take the difference of the two specified tables because their attribute names (and domains, of course) coincide.
In order to avoid the JOIN at the end (which was simply for the purpose of recovering the names associated to the ID's in the Non_CIL102 table, we could have included the names in the intermediate results:
(1) All_ID ← PROJECT ID, Name FROM Student (2) CIL102_ID ← PROJECT StuID AS ID, Name FROM (RESTRICT EnrolledIn WHERE CourseID = 'CIL 102') (3) Non_CIL102 ← All_ID - CIL102_ID (4) Result ← PROJECT Name FROM Non_CIL102
As an example, suppose that our Student table included the attribute SAT_Score and that we wanted to find the average, minimum, and maximum values in that column. In SRA, we would write this query as
AGGREGATE AVG(SAT_Score), MIN(SAT_Score), MAX(SAT_Score) FROM StudentThe result will be a one-tuple table. Deriving its attribute names from the functions applied and the attributes to which they were applied, the table might look like this:
AVG_SAT_Score MIN_SAT_Score MAX_SAT_Score +--------------+---------------+-------------+ | 1040.75 | 913 | 1430 | +--------------+---------------+-------------+If we wanted this data only for seniors, say, we would have nested an application of RESTRICT in the FROM clause, as follows:
AGGREGATE AVG(SAT_Score), MIN(SAT_Score), MAX(SAT_Score) FROM (RESTRICT Student WHERE Class = 4)
Suppose that we want to know how many female seniors have SAT scores above 1100, and the average SAT score among those students. We could write this query like this:
AGGREGATE COUNT(*), AVG(SAT_Score) FROM (RESTRICT Student WHERE Sex = 'F' AND SAT_Score > 1100)
The result might look something like this:
COUNT* AVG_SAT_Score +--------+---------------+ | 57 | 1240.75 | +--------+---------------+
When, as above, an asterisk is specified as the argument of COUNT, it means that the resulting value should be a count of all the tuples in the "target" table (i.e., the table to which the operation is being applied). Had we used, say, COUNT(SAT_Score) instead, the resulting value would be the number of distinct values in the SAT_Score column (of the target table, which, in our example, includes only tuples corresponding to female seniors). If the target table has a key attribute, say K, then COUNT(*) and COUNT(K) are equivalent.
Suppose that we wanted to know how many male students there were, and the average of their SAT scores, and similarly for females. From the examples above, it should be obvious that we could devise two separate queries to ascertain this information, one for males and another for females. It would be nice if we could do it all with a single query. And indeed we can, because we can use a grouping feature to partition the tuples of the target table into separate groups (according to the values in one or more attributes), in which case the aggregate functions are applied to each group of tuples separately.
For example, the SRA query
AGGREGATE Sex, COUNT(*), AVG(SAT_Score) FROM Student GROUP BY Sexwould produce a table such as the following:
Sex COUNT* AVG_SAT_Score +-----+--------+---------------+ | 'F' | 145 | 1040.5 | | 'M' | 137 | 1038.2 | +-----+--------+---------------+
What the first tuple of this table says (literally) is that the Student table has 145 tuples having value 'F' in the Sex column, and the average of the SAT_Score values in these tuples is 1040.5. The second tuple's interpretation is analogous, of course.
Had we omitted the mention of the Sex attribute in the first line of the query, we would have obtained the same table, except without the Sex attribute. Hence, we would have been able to conclude that there were 145 students of one sex and 137 of the other (each having the reported SAT score averages), but we wouldn't know which was which. For this reason, we usually include the grouping attribute(s) in the result. On the other hand, it would not make sense to mention a non-grouping attribute on the first line of an AGGREGATE query, except as the argument of one of the aggregate functions. For example, the following query makes no sense, because there is no way to supply a meaningful value for Name in each tuple of the result.
AGGREGATE Name, Sex, COUNT(*), AVG(SAT_Score) FROM Student GROUP BY Sex
As suggested by the phrase "grouping attribute(s)" above, there can be more than one grouping attribute. Thus, for example, if we wanted to group students by sex and class, we could have said
AGGREGATE Sex, Class, COUNT(*), AVG(SAT_Score) FROM Student GROUP BY Sex, ClassThe result would then have looked like this:
Sex Class COUNT* AVG_SAT_Score +-----+-------+--------+---------------+ | 'F' | 1 | 37 | 1023.5 | | 'F' | 2 | 40 | 1054.1 | | 'F' | 3 | 29 | 1033.7 | | 'F' | 4 | 35 | 1019.5 | | 'M' | 1 | 40 | 1044.7 | | . | . | . | . | | . | . | . | . | +-----+-------+--------+---------------+
As for the RA notation for aggregates, here are what the (valid)
queries above (which were given in SRA notation) would look like
in RA notation:
ℑAVG(SAT_Score), MIN(SAT_Score), MAX(SAT_Score)(Student)
ℑAVG(SAT_Score), MIN(SAT_Score), MAX(SAT_Score)
(σClass=4(Student))
ℑCOUNT(*), AVG(SAT_Score)
(σSex = 'F' AND SAT_Score > 1100(Student))
SexℑSex, COUNT(*), AVG(SAT_Score)(Student)
Sex,ClassℑSex, Class, COUNT(*), AVG(SAT_Score)(Student)
1. List the sex and class of every student named "Chris".
2. List the course IDs of all courses in which a student
named "Chris" is enrolled.
3. List the names of all students enrolled in the course
CIL 102.
4. List the ID's of students enrolled in a course taught by
a faculty member whose ID is "Sarek".
5. List the name and class of any student enrolled in a course
taught by a faculty member whose ID is "Sarek".
6. List the names of all students enrolled in a course
in which a student named "Chris" is enrolled.
7. List the ID's of all courses in which there are enrolled
students from different classes. (In our example, CIL 102 would
be such a course, because both Mary (class 2) and Jim (class 1) are
enrolled in it.)
8. List the number of students enrolled in the course CIL 102.
9. For each student, list her/his name and the number of courses
in which (s)he is enrolled.
10. For each course, list its ID and the number of students enrolled
in it.
11. For each course, list its ID, the ID of the faculty member
teaching it, and the number of students enrolled in it.
12. For each course taught by Knuth, list its ID together with the
number of students who are enrolled in it.
13. Find the average # of students enrolled in each course.
14. For each faculty member, find the average # of students
enrolled in the courses (s)he is teaching.