Recall the most important operations in SQL/Relational Algebra:
Example:
In effect, Restrict acts as a "filter" on the tuples of a relation, keeping some and omitting others.
Example:
In effect, a PROJECT omits some of the attributes (columns) of a relation.
Example:
Join WORKS_ON With DEPARTMENT
Where WORKS_ON.ESSN = DEPARTMENT.MGRSSN
The queries submitted by users to a (relational) DBMS are composed of such operations (often nested two or three levels deep).
Question: When a query is submitted to a DBMS, how does it "figure out" how to respond?
At a high level of abstraction, we can describe it as follows:
+----------------+ +-----------+
SQL query ----> | translator |---> intermediate ---> | query |--->+
| (scan, parse) | form | optimizer | |
+----------------+ +-----------+ |
v
+--------<---------------------------------<--------------------------+
|
v +---------------+
execution ---> | query code |---> code to execute
plan | generator | the query
+---------------+
Among the ways to evaluate it are these:
for each tuple t in R {
if t satisfies C
{ write t into result; }
}
Can also be used if the query asks for a range of values in the
ordering field. Example:
Restrict EMPLOYEE
Where NAME >= 'Rumplestiltskin' & NAME <= 'Smith'
Here you would use binary search to find the first record in the
range and then sequentially access the rest.
This can work for a range query, too (in the cases when the index is not based on hashing), as in
Restrict EMPLOYEE
Where (30000 < SALARY) & (SALARY < 50000)
When restriction conditions are compound:
Project: Consider the query Project A1, A2, ..., An From R
Assuming that we want the result not to contain any duplicate tuples, among the ways to evaluate it are:
for each tuple r in R {
Let r' be tuple obtained by omitting all unwanted attributes;
insert r' into (the appropriate bucket of) hash table T;
}
for each bucket B in T {
sort B;
write to result the tuples in B, omitting duplicates;
}
What's the point of using the hash table? First, because any
collection of duplicate tuples will end up in the same bucket together,
it guarantees that such duplicates can be identified (making it
easy to omit all but one of them from the result).
Secondly, it is cheaper
to sort lots of little collections of records than to sort a
single collection containing all of these records. Also, as the
tuples in the resulting relation/table need not be in any particular
order, we don't mind that this solution does not produce a sorted
relation.
Among the ways of evaluating it are these:
Let T = R×S; (i.e., { r×s | r in R and s in S } )
Restrict T Where R.A = S.B
The approach is horrible, because the size of T will be
m×n, where m and n are the sizes,
respectively, of R and S, and hence is not unlikely to be huge.
for each tuple r in R {
for each tuple s in S {
if (r.A = s.B) { write r×s into result; }
}
}
Approaches (1) and (2) are the same, except that in (2) we don't explicitly construct all of R×S (which could be very big) before filtering out the tuples that fail to satisfy the R.A = S.B condition. Both approaches take time O(m × n), where m is the size of R and n is the size of S, but Ultra Brute Force always takes O(m × n) space, whereas Brute Force is not likely to.
for each tuple r in R {
Let S' = { s in S | r.A = S.B }
for each s in S'
{ write r×s into result; }
}
The idea here is that the tuples in S' are obtained using the existing index on the B attribute of S.
If the set R' = { r.A | r in R } (consisting of all the distinct values found in attribute A of tuples in R) is significantly smaller than the number of tuples in R, then pre-sorting R with respect to A could save time, because then the number of times that S need be searched to find tuples matching the current tuple of R is reduced from the cardinality of R to the cardinality of R', as follows:
sort R with respect to A;
r = first tuple of R;
while (r != null) // r == null means that no more tuples exist
crrntA = r.A;
S' = { s in S | crrntA = S.B }
while (r.A = crrntA) {
for each s in S'
{ write r×s into result; }
r = next tuple of R;
}
}
Notice that the outer loop iterates once for each distinct
A-value occurring among the tuples in R,
rather than once for each tuple in R, and S'
is computed once during each iteration of the outer loop.
sort R wrt A;
sort S wrt B;
// now do coseqential processing similar to Balanced Line algorithm
r = first tuple in R;
s = first tuple in S;
while (r.A != +infinity) && (s.B != +infinity) {
if (r.A < s.B)
{ r = next tuple in R; }
else if (r.A > s.B)
{ s = next tuple in S; }
else { // r.A == s.B
let z = r.A
// next loop is tricky to implement
for every r' in R and s' in S such that r'.A = z = s'.B
{ write r'×s' to result }
}
// at this point r and s are first tuples in R and S for which
// r.A > A value from above and s.B > B value from above
}
Running time is O(m*log m + n*log n + max(m+n,k)), where m and n
are the lengths of R and S and k is the length of the result.
The first term comes from sorting R, the second from sorting S, and the
third comes from performing the cosequential processing.
If secondary indexes exist, could use them (rather than sorting) but this could be very inefficient as the records could be scattered all over the place.
Let h be a hash function
for each tuple r in R
{ place r into bucket h(r.A) of a hash table; }
for each tuple s in S {
for each tuple r in bucket h(s.B) that matches s {
write r×s to result;
}
}
This approach is analogous to nested loop join, but
its running time is better by a constant factor approximating
the number of distinct addresses in the hash table.
Let h be a hash function;
for each tuple r in R
{ place r into bucket h(r.A) of hash table T1 }
for each tuple s in S
{ place s into bucket h(s.B) of hash table T2 }
// at this point, any pair of matching tuples is in
// corresponding buckets of T1 and T2
for each i in 0..M-1 { // M = # of buckets in each hash table
process bucket i of T1 with bucket i of T2 to find matching tuples
and write them to result;
}
Again, this approach is analogous to nested loop join, but
its running time is better by a constant factor approximating
the number of distinct addresses in the hash table.