Importance: Database systems have become an essential component of life in modern society, in that many frequently occurring events trigger the accessing of at least one database: bibliographic library searches, bank transactions, hotel/airline reservations, grocery store purchases, online (Web) purchases, etc., etc.
Traditional vs. more recent applications of databases:
The applications mentioned above are all "traditional" ones for which the use of rigidly-structured textual and numeric data suffices. Recent advances have led to the application of database technology to a wider class of data. Examples include multimedia databases (involving pictures, video clips, and sound messages) and geographic databases (involving maps, satellite images).
Also, database search techniques are applied by some WWW search engines.
The term database is often used, rather loosely, to refer to just about any collection of related data. E&N say that, in addition to being a collection of related data, a database must have the following properties:
To summarize: a database has some source (i.e., the miniworld) from which data are derived, some degree of interaction with events in the represented miniworld (at least insofar as the data is updated when the state of the miniworld changes), and an audience that is interested in using it.
An Aside: data vs. information vs. knowledge: Data is the representation of "facts" or "observations" whereas information refers to the meaning thereof (according to some interpretation). Knowledge, on the other hand, refers to the ability to use information to achieve intended ends.
Computerized vs. manual: Not surprisingly (this being a CS course), our concern will be with computerized database systems, as opposed to manual ones, such as the card catalog-based systems that were used in libraries in ancient times (i.e., before the year 2000). (Some authors wouldn't even recognize a non-computerized collection of data as a database, but E&N do.)
Size/Complexity: Databases run the range from being small/simple (e.g., one person's recipe database) to being huge/complex (e.g., Amazon's database that keeps track of all its products, customers, and suppliers).
Definition: A database management system (DBMS) is a collection of programs enabling users to create and maintain a database.
More specifically, a DBMS is a general purpose software system facilitating each of the following (with respect to a database):
Given all its responsibilities, it is not surprising that a typical DBMS is a complex piece of software.
A database together with the DBMS software is referred to as a database system. (See Figure 1.1, page 7.)
Among the main ideas illustrated in this example is that each file/relation/table has a set of named fields/attributes/columns, each of which is specified to be of some data type. (In addition to a data type, we might put further restrictions upon a field, e.g., GRADE_REPORT must have a value from the set {'A', 'B', ..., 'F'}.)
The idea is that, of course, each table will be populated with data in the form of records/tuples/rows, each of which represents some entity (in the miniworld) or some relationship between entities.
For example, each record in the STUDENT table represents a ---surprise!--- student. Similarly for the COURSE and SECTION tables.
On the other hand, each record in GRADE_REPORT represents a relationship between a student and a section of a course. And each record in PREREQUISITE represents a relationship between two courses.
Database manipulation involves querying and updating.
Examples of (informal) queries:
Examples of (informal) updates:
Of course, a query/update must be conveyed to the DBMS in a precise way (via the query language of the DBMS) in order to be processed.
As with software in general, developing a new database (or a new application for an existing database) proceeds in phases, including requirements analysis and various levels of design (conceptual (e.g., Entity-Relationship Modeling), logical (e.g., relational), and physical (file structures)).
Database approach vs. File Processing approach: Consider an organization/enterprise that is organized as a collection of departments/offices. Each department has certain data processing "needs", many of which are unique to it. In the file processing approach, each department would control a collection of relevant data files and software applications to manipulate that data.
For example, a university's Registrar's Office would maintain data (and programs) relevant to student grades and course enrollments. The Bursar's Office would maintain data (and programs) pertaining to fees owed by students for tuition, room and board, etc. (Most likely, the people in these offices would not be in direct possession of their data and programs, but rather the university's Information Technology Department would be responsible for providing services such as data storage, report generation, and programming.)
One result of this approach is, typically, data redundancy, which not only wastes storage space but also makes it more difficult to keep changing data items consistent with one another, as a change to one copy of a data item must be made to all of them (called duplication-of-effort). Inconsistency results when one (or more) copies of a datum are changed but not others. (E.g., If you change your address, informing the Registrar's Office should suffice to ensure that your grades are sent to the right place, but does not guarantee that your next bill will be, as the copy of your address "owned" by the Bursar's Office might not have been changed.)
In the database approach, a single repository of data is maintained that is used by all the departments in the organization. (Note that "single repository" is used in the logical sense. In physical terms, the data may be distributed among various sites, and possibly mirrored.)
Main Characteristics of database approach:
See Figures 1.1 and 1.3.
The system catalog is used not only by users (e.g., who need to know the names of tables and attributes, and sometimes data type information and other things), but also by the DBMS software, which certainly needs to "know" how the data is structured/organized in order to interpret it in a manner consistent with that structure. Recall that a DBMS is general purpose, as opposed to being a specific database application. Hence, the structure of the data cannot be "hard-coded" in its programs (such as is the case in typical file processing approaches), but rather must be treated as a "parameter" in some sense.
Program-Data Independence: In traditional file processing, the structure of the data files accessed by an application is "hard-coded" in its source code. (E.g., Consider a file descriptor in a COBOL program: it gives a detailed description of the layout of the records in a file by describing, for each field, how many bytes it occupies.)
If, for some reason, we decide to change the structure of the data (e.g., by adding the first two digits to the YEAR field, in order to make the program Y2K compliant!), every application in which a description of that file's structure is hard-coded must be changed!
In contrast, DBMS access programs, in most cases, do not require such changes, because the structure of the data is described (in the system catalog) separately from the programs that access it and those programs consult the catalog in order to ascertain the structure of the data (i.e., providing a means by which to determine boundaries between records and between fields within records) so that they interpret that data properly.
See Figure 1.4.
In other words, the DBMS provides a conceptual or logical view of the data to application programs, so that the underlying implementation may be changed without the programs being modified. (This is referred to as program-data independence.)
Also, which access paths (e.g., indexes) exist are listed in the catalog, helping the DBMS to determine the most efficient way to search for items in response to a query.
Note: In fairness to COBOL, it should be pointed out that it has a COPY feature that allows different application programs to make use of the same file descriptor stored in a "library". This provides some degree of program-data independence, but not nearly as much as a good DBMS does. End of note.
Example by which to illustrate this concept: Suppose that you are given the task of developing a program that displays the contents of a particular data file. Specifically, each record should be displayed as follows:
Record #i:
value of first field
value of second field
...
...
value of last field
To keep things very simple, suppose that the file in question has
fixed-length records of 57 bytes with six fixed-length fields of
lengths 12, 4, 17, 2, 15, and 7 bytes, respectively, all of which
are ASCII strings.
Developing such a program would not be difficult. However, the
obvious solution would be tailored specifically for a file having
the particular structure described here and would be of no use for
a file with a different structure.
Now suppose that the problem is generalized to say that the program you are to develop must be able to display any file having fixed-length records with fixed-length fields that are ASCII strings. Impossible, you say? Well, yes, unless the program has the ability to access a description of the file's structure (i.e., lengths of its records and the fields therein), in which case the problem is not hard at all. This illustrates the power of metadata, i.e., data describing other data.
As another example, a Registrar's Office employee might think that GPA is a field of data in each student's record. In reality, the underlying database might calculate that value each time it is needed. This is called virtual (or derived) data.
A view designed for an academic advisor might give the appearance that the data is structured to point out the prerequisites of each course.
(See Figure 1.5, page 14.)
A good DBMS has facilities for defining multiple views. This is not only convenient for users, but also addresses security issues of data access. (E.g., The Registrar's Office view should not provide any means to access financial data.)
Arising from this is the need for concurrency control, which is supposed to ensure that several users trying to update the same data do so in a "controlled" manner so that the results of the updates are as though they were done in some sequential order (rather than interleaved, which could result in data being incorrect).
This gives rise to the concept of a transaction, which is a process that makes one or more accesses to a database and which must have the appearance of executing in isolation from all other transactions (even ones that access the same data at the "same time") and of being atomic (in the sense that, if the system crashes in the middle of its execution, the database contents must be as though it did not execute at all).
Applications such as airline reservation systems are known as online transaction processing applications.
These apply to "large" databases, not "personal" databases that are defined, constructed, and used by a single person via, say, Microsoft Access.
On the other hand, redundancy can be used to improve performance of queries. Indexes, for example, are entirely redundant, but help the DBMS in processing queries more quickly.
Another example of using redundancy to improve performance is to store an "extra" field in order to avoid the need to access other tables (as when doing a JOIN, for example). See Figure 1.6 (page 18): the StudentName and CourseNumber fields need not be there.
A DBMS should provide the capability to automatically enforce the rule that no inconsistencies are introduced when data is updated. (Figure 1.6 again, in which Student_name does not match Student_number.)
The query processing and optimization module is responsible for choosing an efficient query execution plan for each query submitted to the system. (See Chapter 15.)
Another kind of constraint is referential integrity, which says that if the database includes an entity that refers to another one, the latter entity must exist in the database. For example, if (R56547, CIL102) is a tuple in the Enrolled_In relation, indicating that a student with ID R56547 is taking a course with ID CIL102, there must be a tuple in the Student relation corresponding to a student with that ID.
Active database systems go one step further by allowing "active rules" that can be used to initiate actions automatically.