CMPS 340 File Processing
(SQL) Query Processing
based on Chapter 15 of Elmasri & Navathe

Recall the most important operations in SQL/Relational Algebra:


The queries submitted by users to a (relational) DBMS are composed of such operations (often nested two or three levels deep).

Question: When a query is submitted to a DBMS, how does it "figure out" how to respond?

At a high level of abstraction, we can describe it as follows:

                  +----------------+                       +-----------+
  SQL query ----> |  translator    |---> intermediate ---> | query     |--->+
                  | (scan, parse)  |         form          | optimizer |    |
                  +----------------+                       +-----------+    |
                                                                            v
      +--------<---------------------------------<--------------------------+
      |
      v           +---------------+
  execution --->  |  query code   |--->  code to execute 
    plan          |   generator   |        the query
                  +---------------+

Basic Algorithms for Executing (single) Query Operations

Restrict: Consider the query   Restrict R Where C

Among the ways to evaluate it are these:

  1. Brute Force:
          for each tuple t in R {
             if t satisfies C
                { write t into result; }
          }
        
  2. Binary Search: can be used if the restriction condition (C) is an equality comparison on the ordering field of the file containing relation R. An example would be
    Restrict EMPLOYEE Where NAME = 'Rumplestiltskin'

    Can also be used if the query asks for a range of values in the ordering field. Example:

               Restrict EMPLOYEE 
               Where NAME >= 'Rumplestiltskin'  &  NAME <= 'Smith'
    Here you would use binary search to find the first record in the range and then sequentially access the rest.

  3. index or hash table: if the restriction condition involves an equality comparison on a field for which an index (flat, B+-tree, or hash (scatter) table) exists, use it to obtain pointers to the relevant record(s), then fetch them.

    This can work for a range query, too (in the cases when the index is not based on hashing), as in

               Restrict EMPLOYEE 
               Where (30000 < SALARY) & (SALARY < 50000)

When restriction conditions are compound:


Project: Consider the query Project A1, A2, ..., An From R

Assuming that we want the result not to contain any duplicate tuples, among the ways to evaluate it are:

  1. Sort-based: Let R' be the relation obtained from R by omitting all attributes except A1, A2, ..., An. Sort R' and, while (or after) doing so, omit duplicate tuples. Note that the resulting table will be sorted, which is not required.

  2. Hashing-based:
        for each tuple r in R {
           Let r' be tuple obtained by omitting all unwanted attributes;
           insert r' into (the appropriate bucket of) hash table T;
        }
        for each bucket B in T {
           sort B;
           write to result the tuples in B, omitting duplicates;
        }
    What's the point of using the hash table? First, because any collection of duplicate tuples will end up in the same bucket together, it guarantees that such duplicates can be identified (making it easy to omit all but one of them from the result). Secondly, it is cheaper to sort lots of little collections of records than to sort a single collection containing all of these records. Also, as the tuples in the resulting relation/table need not be in any particular order, we don't mind that this solution does not produce a sorted relation.

Join: Consider the query Join R With S Where R.A = S.B

Among the ways of evaluating it are these:

  1. Ultra-Brute Force:
           Let T = R×S;  (i.e., { r×s | r in R  and  s in S } )
           Restrict T Where R.A = S.B
    The approach is horrible, because the size of T will be m×n, where m and n are the sizes, respectively, of R and S, and hence is not unlikely to be huge.

  2. Brute force (nested loop join):
           for each tuple r in R {
              for each tuple s in S {
                 if (r.A = s.B) { write r×s into result; }
              }
           }

    Approaches (1) and (2) are the same, except that in (2) we don't explicitly construct all of R×S (which could be very big) before filtering out the tuples that fail to satisfy the R.A = S.B condition. Both approaches take time O(m × n), where m is the size of R and n is the size of S, but Ultra Brute Force always takes O(m × n) space, whereas Brute Force is not likely to.

  3. Single loop join (assumes existence of an access path (i.e., index) for S.B):
           for each tuple r in R {
              Let S' = { s in S | r.A = S.B }
              for each s in S'
                 { write r×s into result; }
           }

    The idea here is that the tuples in S' are obtained using the existing index on the B attribute of S.

    If the set R' = { r.A | r in R } (consisting of all the distinct values found in attribute A of tuples in R) is significantly smaller than the number of tuples in R, then pre-sorting R with respect to A could save time, because then the number of times that S need be searched to find tuples matching the current tuple of R is reduced from the cardinality of R to the cardinality of R', as follows:

           sort R with respect to A;
           r = first tuple of R;
           while (r != null)    // r == null means that no more tuples exist
              crrntA = r.A;
              S' = { s in S | crrntA = S.B }
              while (r.A = crrntA) {
                 for each s in S'
                    { write r×s into result; }
                 r = next tuple of R;
              }
           }  
        
    Notice that the outer loop iterates once for each distinct A-value occurring among the tuples in R, rather than once for each tuple in R, and S' is computed once during each iteration of the outer loop.

  4. Sort-merge join: Sort the tuples of R and S with respect to attributes A and B, resp., and then use cosequential processing to combine the matching tuples. (Of course, if either relation is already ordered in the way we want, we need not sort it.)
        sort R wrt A;
        sort S wrt B;
        // now do cosequential processing similar to Balanced Line algorithm
        r = first tuple in R;
        s = first tuple in S;
        while (r.A != +∞) && (s.B != +∞) {
           if (r.A < s.B)
              { r = next tuple in R; }
           else if (r.A > s.B)
              { s = next tuple in S; }
           else {  // r.A == s.B 
              let z = r.A
              // next loop is tricky to implement
              for every r' in R and s' in S such that r'.A = z = s'.B
                 { write r'×s' to result }
           }
           // at this point r and s are first tuples in R and S for which
           // r.A > A value from above and s.B > B value from above
        } 
    Running time is O(m*log m + n*log n + max(m+n,k)), where m and n are the lengths of R and S and k is the length of the result. The first term comes from sorting R, the second from sorting S, and the third comes from performing the cosequential processing.

    If secondary indexes exist, could use them (rather than sorting) but this could be very inefficient as the records could be scattered all over the place.

  5. Hash-join (for the case in which the smaller of the two relations (here assumed to be R) fits in RAM):
         Let h be a hash function
         for each tuple r in R
            { place r into bucket h(r.A) of a hash table; }
         for each tuple s in S { 
            for each tuple r in bucket h(s.B) that matches s {
               write r×s to result;
            }
         } 
    This approach is analogous to nested loop join, but its running time is better by a constant factor approximating the number of distinct addresses in the hash table.

  6. Partition Hash Join:
         Let h be a hash function;
         for each tuple r in R
            { place r into bucket h(r.A) of hash table T1 }
         for each tuple s in S
            { place s into bucket h(s.B) of hash table T2 }
    
         // at this point, any pair of matching tuples is in
         // corresponding buckets of T1 and T2
         for each i in 0..M-1 {  // M = # of buckets in each hash table
            process bucket i of T1 with bucket i of T2 to find matching tuples
            and write them to result;
         }
    Again, this approach is analogous to nested loop join, but its running time is better by a constant factor approximating the number of distinct addresses in the hash table.