Acknowledgement: Much of this web page is based upon Elmasri's and Navathe's text, Fundamentals of Database Systems (3rd ed.), published by Addison-Wesley, 2000.
Database systems have become an essential component of life in modern society, in that many frequently occurring events trigger the accessing of at least one database: bibliographic library searches, bank transactions, hotel/airline reservations, grocery store purchases, etc., etc.
These are all "traditional" database applications, for which the use of rigidly-structured textual and numeric data suffices. Recent advances have led to the application of database technology to a wider class of data. Examples include multimedia databases (involving pictures, video clips, and sound messages) and geographic databases (involving maps, satellite images).
Also, database search techniques are applied by some WWW search engines.
The term database is often used, too loosely, to refer to just about any collection of data. For us, a database is a collection of related data having the following properties:
To summarize: a database has some source from which data are derived, some degree of interaction with events in the represented miniworld, and an audience that is interested in using it.
Obviously, our concern will be with computerized database systems, as opposed to manual ones, such as the card catalog-based systems that were once indispensible for finding items in libraries.
As for size and complexity, databases run the range from being small and simple to being huge and complex.
A database management system (DBMS) is a general purpose software system facilitating each of the following (with respect to a database):
There are logical boundaries between users, application programs, DBMS software, and the data itself. Note that the database catalog contains "meta-data" (i.e., data about data) describing the way that data is to be organized.
The DBMS provides a conceptual or logical view of the data to application programs (and users), so that the underlying implementation may be changed without the programs being modified. This is referred to as program-data independence and is one of the benefits of information hiding (a better term for which might be implementation hiding), a design principle saying that a user/client of a virtual machine should have no way of directly accessing or exploiting the underlying details of its (the virtual machine's) representation/implementation.
Most database management systems (DBMS's) you are likely to encounter these days are based upon the relational database model, which was introduced by the late E.F. Codd in a seminal paper published in 1970. This model is based upon the mathematical concepts of relation (as you should have learned about in MATH 142/CMPS 260), set theory, and (first order) predicate logic. By the late 1980's, the relational model had become dominant, surpassing the older hierarchical and network models. Recent years have seen the development of the object-oriented database model, which is an extension of the relational model. Commercial products based upon the relational database model include Microsoft Access and products from Corel, Oracle, and FoxPro.
A relational database is comprised of two-dimensional tables. (A table can also be called a relation, although relational "purists" would argue that there is a subtle distinction between the two.) Each "row" of a table is called a tuple. Each tuple is made up of fields, one for each attribute of the table. (The attributes are the names we associate to the fields/columns.)
Here is an example of a table in which each tuple describes a student in a hypothetical high school or 4-year college.
Student ID Name Sex Class
+------+----------+-----+-------+
| 2 | Mary | F | 2 |
+------+----------+-----+-------+
| 6 | John | M | 4 |
+------+----------+-----+-------+
| 9 | Carol | F | 3 |
+------+----------+-----+-------+
| 7 | Mary | F | 4 |
+------+----------+-----+-------+
| 1 | Ann | F | 1 |
+------+----------+-----+-------+
| 13 | Mike | M | 2 |
+------+----------+-----+-------+
| 8 | Helen | F | 4 |
+------+----------+-----+-------+
| 5 | Jim | M | 1 |
+------+----------+-----+-------+
| 16 | Mike | M | 3 |
+------+----------+-----+-------+
Following the usual convention, we have written the name of the table, Student, immediately to the left of the attributes, which serve as column headings. As the reader has probably surmised, we have chosen to encode the values corresponding to male and female, respectively, by M and F. Also, the values 1, 2, 3, and 4 occurring in the Class attribute correspond to freshman, sophomore, junior, and senior, respectively.
The reader should keep in mind that the order in which we list the tuples occurring in a table is irrelevant. That is, a table is viewed as being a set of tuples, not a sequence of tuples.
One of the central functions of a DBMS is to answer questions (about the data in the database) posed to it by users. (Indeed, there would be no reason to store the data in the first place if there were no desire to retrieve bits and pieces of it later for the purpose of answering such questions.) Such a question is referred to as a query. For example, with respect to the table above, a user might ask What are the names of the female students? Or, to state it in the form of a command: List the names of the female students. More examples: List the names of female freshmen; List the name and sex of each senior. In the relational database model, the answers to queries such as these are themselves tables (although such tables do not become a "permanent" part of the database). For example, the answer to the last query mentioned would be:
Name Sex
+----------+-----+
| John | M | List the name and sex
+----------+-----+ of each senior.
| Mary | F |
+----------+-----+
| Helen | F |
+----------+-----+
Notice that this table is obtained from the Student table by first eliminating all but those tuples corresponding to seniors (i.e., tuples whose Class attribute has value 4) and then eliminating all but the Name and Sex attributes. (You will see shortly that the elimination of tuples and attributes correspond to two of the fundamental operations used for answering queries.
Because computers' abilities to "understand" natural language (such as English) is very limited (at least to this point in time), a user must, in specifying a query, employ a more precise and formal notation. One standard mathematical notation used for expressing queries is called relational algebra (RA). A more user-friendly notation (based upon the same principles) is SQL (Standard Query Language). We will use a hybrid of the two, which we shall call sugared relational algebra (SRA).
Queries in RA are based upon the use of three elementary operations on tables: project, restrict, and join. (The "restrict" operation is usually called "select", but here we use the terminology of C.J. Date (prolific author on the subject of the relational model), in part because the SELECT verb in SQL has an entirely different meaning.)
Applying project to a table yields a copy of that table, but (possibly) with some of its attributes (i.e., columns) excluded. (Presumably, the excluded attributes are not of interest.) SRA's syntax for an application of the project operation is as follows:
For example, the SRA expression
Name Class
+--------+-----+
| Mary | 2 |
+--------+-----+
| John | 4 | PROJECT Name, Class
+--------+-----+ FROM Student
| Carol | 3 |
+--------+-----+
| Mary | 4 |
+--------+-----+
| Ann | 1 |
+--------+-----+
| Mike | 2 |
+--------+-----+
| Helen | 4 |
+--------+-----+
| Jim | 1 |
+--------+-----+
| Mike | 3 |
+--------+-----+
The query PROJECT Name, Class FROM Student, then, is nothing but a more formal way of stating the (English) query
(Note that Elmasri & Navathe (and almost everyone else) calls this
select, but we refrain from doing so because the meaning of
that term in SQL is quite different.)
Applying restrict to a table yields a copy of that table, but
(possibly) with some of its tuples (i.e., rows) excluded, namely those
tuples that fail to satisfy a specified condition.
(Presumably, any tuple failing to satisfy the condition is not of interest
to the user.)
In keeping with the syntactic style used for project, an application
of the restrict operation will be written in SRA in the following
way:
The condition is simply a boolean expression to be evaluated with respect
to a tuple. (Hence, any sensible condition will mention one or more
attributes of the table.)
For example, the expression
ID Name Sex Class
+------+----------+-----+-------+
| 6 | John | M | 4 |
+------+----------+-----+-------+
| 13 | Mike | M | 2 | RESTRICT Student
+------+----------+-----+-------+ WHERE Sex = 'M'
| 5 | Jim | M | 1 |
+------+----------+-----+-------+
| 16 | Mike | M | 3 |
+------+----------+-----+-------+
By using boolean (or logical, if you prefer)
operators (such as AND, OR, and NOT)
in the condition, we can express more complicated queries.
For example, if we wanted to form a table like Student,
except listing only students who are either (a) sophomore and male or
(b) female, we could write
With only the project and restrict operations, we have
the ability to describe many non-trivial queries. For example, suppose
that we wanted a list containing the name and sex of every senior.
From the examples above, it should be clear that the expression
In RA, this would be
This example illustrates that one query can be "nested" inside another. Indeed, as the RA syntax suggests, nesting one query inside another is nothing more than function composition.
As another example, suppose we wanted a list containing the name and sex of every student who is either a male sophomore or else a senior. An appropriate SRA query is
PROJECT Name, Sex
FROM (RESTRICT FROM Student
WHERE (Sex = 'M' AND Class = 2) OR (Class = 4)
)
The query is split over multiple lines simply because it is too long to
fit on a single line. In the more concise RA notation,
this would be
As queries get more complicated, requiring the use of two, three, or more nested applications of project and/or restrict, they become difficult to understand. Hence, we allow "temporary" tables to be given names. For example, the query above describing the list of names and sexes of students who are either male sophomores or seniors (of either sex) can be written as
(1) Temp ← RESTRICT FROM Student WHERE (Sex = 'M' AND Class = 2) OR (Class = 4) (2) Result ← PROJECT Name, Sex FROM TempHere, in line (1) we specify that the table obtained from doing the restrict is to be named (rather unimaginatively) Temp, and then we apply project to that table in line (2) in order to obtain the desired result, to which we give the name Result!
In any but the most trivial relational databases there will be two, three, or possibly many more tables. In such a setting, to answer most interesting queries will require the use of two or more tables. This is where the join operation becomes useful.
Suppose that, in addition to Student (as illustrated above), our database also includes a table whose purpose is to keep a record of which students are currently enrolled in which courses. (Such a table would arise from there being a many-to-many relationship type involving student and course entities.) Let us call this the Enrolled-In table; for the purposes of doing examples, suppose that its current contents are as follows:
Enrolled-In StuID CourseID
+-----+------------+
| 2 | CIL 102 |
+-----+------------+
| 2 | MATH 2 |
+-----+------------+
| 1 | HIST 7 |
+-----+------------+
| 7 | MATH 2 |
+-----+------------+
| 1 | ENGL 4 |
+-----+------------+
| 1 | MATH 2 |
+-----+------------+
| 13 | HIST 7 |
+-----+------------+
| 5 | HIST 7 |
+-----+------------+
| 5 | CIL 102 |
+-----+------------+
Note justifying the addition of such a table: (This can be skipped by the uninterested reader.) It is true that the information represented by the Enrolled_In table could be embedded within the Student table, assuming that we added an attribute for CourseID to that table. However, to do so would result in the duplication of much data, because, for each course a particular student was enrolled in, her ID, Name, Sex, and Class values would have to be repeated. For example, the fact that Mary was enrolled in both CIL 102 and MATH 2 would require that both of these rows/tuples appear in our expanded Student table:
ID Name Sex Class Enrolled-In
+------+----------+-----+-------+-------------+
| 2 | Mary | F | 2 | CIL 102 |
| 2 | Mary | F | 2 | MATH 2 |
+------+----------+-----+-------+-------------+
Duplication of data is undesirable for at least two reasons. One is
that it takes more storage space than necessary. Another is that it
makes the problem of maintaining "data integrity/consistency" much more
difficult. Suppose, for example, that Mary moves on to Class 3.
Then it becomes necessary to modify every tuple
storing data regarding Mary. If, for some reason, some tuples are
modified but others are not, we have an inconsistency in the data.
To avoid the need to maintain multiple tuples representing Mary, you might suggest that the Enrolled-In field be of type set of string, rather than just string. That way, the value { "CIL 102", "MATH 2" } could be assigned to the Enrolled-In field of the tuple representing Mary, indicating that the two courses in which Mary is enrolled are CIL 102 and MATH 2
This is a reasonable suggestion; however, for technical reasons it has
traditionally been a rule of the relational model that all attributes
must be viewed as being atomic, meaning, for example, that if an attribute
has a value that is a set, its individual members cannot be
extracted/identified by any operation of the relational model.
In the case of Mary's courses, this would not be acceptable, because it
would prevent us, for example, from using relational operations (such as
join described below) that could help us answer a query such as
"List the names of the courses in which Mary is enrolled." (Here we're
assuming that there is a table Courses in which each tuple
identifies a course by its ID and also gives its name.)
End of note.
Consider the following informal query:
A somewhat more formal way of saying this is
Produce a list of all ordered pairs (x,y) satisfying the condition that x is (the name of) a student and y is (the course ID of) a course in which x is enrolled.
The correct result would be a table containing the tuples in Enrolled-In, except with each student ID replaced by the corresponding student's name. Clearly, such a table cannot be constructed from the Student and Enrolled-In tables using only the project and restrict operations. (Indeed, neither of those operations allows us to construct a table whose tuples are formed by combining tuples from two (or more) different tables. But that's exactly what we need here, because all information about student names is in the Student table whereas all information about course ID's is in the Enrolled-In table.)
Consider how you might go about constructing an answer to this query. Most likely, you would scan through the tuples of Enrolled-In; for each such tuple e you would note the value of e[StuID] (i.e., the value in its StuID field) and then scan through the tuples of Student in search of the tuple s satisfying s[ID] = e[StuID] (i.e., having a matching value in its ID field). Upon finding it, you would place into the table under construction the tuple <s[Name],e[CourseID]>.
The crucial concept here is that of combining a tuple in one table with a matching tuple in another table. (In our example, what made one tuple match another was having the same value in their ID and StuID attributes, respectively. In a different query, the matching criterion would be different.)
This capability of combining matching tuples from two different tables
is precisely what the join operation provides. Specifically,
the SRA expression
yields as its value the table obtained by combining every pair of tuples (from the two indicated tables) that satisfy the matching condition.
In RA, the join operator is a bowtie symbol, but HTML has no similar symbol. So we will use the × operator, and put the join condition as a subscript to its right. Hence, we will write a join in RA as
To illustrate this idea, consider these two tables:
Table A Table B
V W X Y Z
+---+---+ +---+---+---+
| r | 2 | | 5 | g | p |
| t | 4 | | 4 | d | e |
| p | 6 | | 2 | m | q |
+---+---+ | 4 | t | f |
+---+---+---+
The result of the expression
JOIN A WITH B WHERE W = X is the table
V W X Y Z
+---+---+---+---+---+
| r | 2 | 2 | m | q |
| t | 4 | 4 | d | e |
| p | 6 | 4 | t | f |
+---+---+---+---+---+
Here we combined a tuple t in A with a tuple u in B iff t[W] = u[X]. This (i.e., comparing two attributes for equality) is the most common kind of join condition. However, it's not the only kind. Consider
The resulting table consists of all those combinations of tuples t and u from A and B, respectively, in which t[W] < u[X]:
V W X Y Z
+---+---+---+---+---+
| r | 2 | 5 | g | p |
| r | 2 | 4 | d | e |
| r | 2 | 4 | t | f |
| t | 4 | 5 | g | p |
+---+---+---+---+---+
Returning to our problem of devising a query that asks for the names of students and the ID's of courses in which they are enrolled, our first step is to employ the join operation as follows:
JOIN Enrolled-In WITH Student WHERE StuID = IDThe resulting table would be
StuID CourseID ID Name Sex Class +-----+------------+------+----------+-----+-------+ | 2 | CIL 102 | 2 | Mary | F | 2 | +-----+------------+------+----------+-----+-------+ | 2 | MATH 2 | 2 | Mary | F | 2 | +-----+------------+------+----------+-----+-------+ | 1 | HIST 7 | 1 | Ann | F | 1 | +-----+------------+------+----------+-----+-------+ | 7 | MATH 2 | 7 | Mary | F | 4 | +-----+------------+------+----------+-----+-------+ | 1 | ENGL 4 | 1 | Ann | F | 1 | +-----+------------+------+----------+-----+-------+ | 1 | MATH 2 | 1 | Ann | F | 1 | +-----+------------+------+----------+-----+-------+ | 13 | HIST 7 | 13 | Mike | M | 2 | +-----+------------+------+----------+-----+-------+ | 5 | HIST 7 | 5 | Jim | M | 1 | +-----+------------+------+----------+-----+-------+ | 5 | CIL 102 | 5 | Jim | M | 1 | +-----+------------+------+----------+-----+-------+Note that each tuple's first two columns come from a tuple in Enrolled-In and its last four columns come from a matching tuple in Student.
From the table immediately above (resulting from the application of join), we get the desired result simply by omitting every column except those corresponding to Name and CourseID. To do this, we apply project. Hence, the SRA query that we want is
PROJECT Name, CourseID
FROM (JOIN Enrolled-In WITH Student
WHERE StuID = ID
)
Returning to the problem of finding students who are either female or else both male and in the sophomore class, we could have written the solution as follows:
Females ← RESTRICT Student WHERE Sex = 'F' Males ← RESTRICT Student WHERE Sex = 'M' Sophomores ← RESTRICT Student WHERE Class = 2 Result ← Females ∪ (Males ∩ Sophomores)
For another example, suppose that we wanted to produce the list of students who are not enrolled in CIL 102. Your first inclination might be to produce a table containing the ID's of students enrolled in a course other than CIL 102 and then to join that with Student in order to obtain the corresponding names:
(1) Non_CIL102 ← PROJECT StuID FROM (RESTRICT EnrolledIn WHERE CourseID ≠ 'CIL 102')
(2) Result ← PROJECT Name
FROM (JOIN Student WITH Non_CIL102 WHERE Student.ID = Non_CIL102.StuID)
But this is incorrect, as it will yield the name of every student who
is enrolled in some course other than CIL 102. In particular,
students not enrolled in any courses will be (incorrectly) omitted and
students enrolled in CIL 102 as well as at least
one other course will be (incorrectly) included.
A correct approach would be to produce a table containing the ID's of all students and another table containing the ID's of students enrolled in CIL 102, and then to take the difference of the two, which will be a table including precisely the ID's of students not enrolled in CIL 102. This idea leads to the query
(1) All_ID ← PROJECT ID FROM Student
(2) CIL102_ID ← PROJECT StuID FROM (RESTRICT EnrolledIn WHERE CourseID = 'CIL 102')
(3) Non_CIL102(ID) ← All_ID - CIL102_ID
(4) Result ← PROJECT Name
FROM (JOIN Student WITH Non_CIL102 WHERE Student.ID = Non_CIL102.ID)
Notice that in step (3) above we took the difference of two tables each having a single attribute (ID and StuID, respectively), and we explicitly renamed the result's attribute ID.
As an example, suppose that our Student table included the attribute SAT_Score and that we wanted to find the average, minimum, and maximum values in that column. In SRA, we would write this query as
AGGREGATE AVG(SAT_Score), MIN(SAT_Score), MAX(SAT_Score) FROM StudentThe result will be a one-tuple table. Deriving its attribute names from the functions applied and the attributes to which they were applied, the table might look like this:
AVG_SAT_Score MIN_SAT_Score MAX_SAT_Score +--------------+---------------+-------------+ | 1040.75 | 913 | 1430 | +--------------+---------------+-------------+If we wanted this data only for seniors, say, we would have nested an application of RESTRICT in the FROM clause, as follows:
AGGREGATE AVG(SAT_Score), MIN(SAT_Score), MAX(SAT_Score) FROM (RESTRICT Student WHERE Class = 4)
Suppose that we want to know how many female seniors have SAT scores above 1100, and the average SAT score among those students. We could write this query like this:
AGGREGATE COUNT(*), AVG(SAT_Score)
FROM (RESTRICT Student
WHERE Sex = 'F' AND SAT_Score > 1100)
The result might look something like this:
COUNT* AVG_SAT_Score
+--------+---------------+
| 57 | 1240.75 |
+--------+---------------+
When, as above, an asterisk is specified as the argument of COUNT, it means that the resulting value should be a count of all the tuples in the "target" table (i.e., the table to which the operation is being applied). Had we used, say, COUNT(SAT_Score) instead, the resulting value would be the number of distinct values in the SAT_Score column (of the target table, which, in our example, includes only tuples corresponding to female seniors). If the target table has a key attribute, say K, COUNT(*) and COUNT(K) are equivalent.
Suppose that we wanted to know how many male students there were, and the average of their SAT scores, and similarly for females. From the examples above, it should be obvious that we could devise two separate queries to ascertain this information, one for males and another for females. It would be nice if we could do it all with a single query. And indeed we can, because we can use a grouping feature to partition the tuples of the target table into separate groups (according to the values in one or more attributes), in which case the aggregate functions are applied to each group of tuples separately.
For example, the SRA query
AGGREGATE Sex, COUNT(*), AVG(SAT_Score) FROM Student GROUP BY Sexwould produce a table such as the following:
Sex COUNT* AVG_SAT_Score +-----+--------+---------------+ | 'F' | 145 | 1040.5 | | 'M' | 137 | 1038.2 | +-----+--------+---------------+
What the first tuple of this table says (literally) is that the Student table has 145 tuples having value 'F' in the Sex column, and the average of the SAT_Score values in these tuples is 1040.5. The second tuple's interpretation is analogous, of course.
Had we omitted the mention of the Sex attribute in the first line of the query, we would have obtained the same table, except without the Sex attribute. Hence, we would have been able to conclude that there were 145 students of one sex and 137 of the other (each having the reported SAT score averages), but we wouldn't know which was which. For this reason, we usually include the grouping attribute(s) in the result. On the other hand, it would not make sense to mention a non-grouping attribute on the first line of an AGGREGATE query, except as the argument of one of the aggregate functions. For example, the following query makes no sense, because there is no way to supply a meaningful value for Name in each tuple of the result.
AGGREGATE Name, Sex, COUNT(*), AVG(SAT_Score) FROM Student GROUP BY Sex
As suggested by the phrase "grouping attribute(s)" above, there can be more than one grouping attribute. Thus, for example, if we wanted to group students by sex and class, we could have said
AGGREGATE Sex, Class, COUNT(*), AVG(SAT_Score) FROM Student GROUP BY Sex, ClassThe result would then have looked like this:
Sex Class COUNT* AVG_SAT_Score +-----+-------+--------+---------------+ | 'F' | 1 | 37 | 1023.5 | | 'F' | 2 | 40 | 1054.1 | | 'F' | 3 | 29 | 1033.7 | | 'F' | 4 | 35 | 1019.5 | | 'M' | 1 | 40 | 1044.7 | | . | . | . | . | | . | . | . | . | +-----+-------+--------+---------------+
As for the RA notation for aggregates, here are what the (valid)
queries above (where there were given in SRA notation) would like like
in RA notation:
ℑAVG(SAT_Score), MIN(SAT_Score), MAX(SAT_Score)(Student)
ℑAVG(SAT_Score), MIN(SAT_Score), MAX(SAT_Score)
(σClass=4(Student))
ℑCOUNT(*), AVG(SAT_Score)
(σSex = 'F' AND SAT_Score > 1100(Student))
SexℑSex, COUNT(*), AVG(SAT_Score)(Student)
Sex,ClassℑSex, Class, COUNT(*), AVG(SAT_Score)(Student)
1. List the sex and class of every student named "George".
2. List the names of all students enrolled in the course
CIL 102.
3. List the ID's of all courses in which there are enrolled
students from different classes. (In our example, CIL 102 would
be such a course, because both Mary (class 2) and Jim (class 1) are
enrolled in it.)
4. List each course ID together with the number of students
that are enrolled in it.
For the following exercises, suppose that Teaches was a third table in the database, with attributes Course_ID and Faculty_ID.
5. List the ID's of students enrolled in a course taught by
a faculty member whose ID is "Sarek".
6. List the name and class of any student enrolled in a course
taught by a faculty member whose ID is "Sarek".