At one level of abstraction, a file is simply a stored sequence of bytes, a view that is directly supported in some programming languages. The programs (and users) that access a file typically view it from a somewhat higher level of abstraction, however. A file containing text, such as source code or an e-mail message, is often viewed as a sequence of lines, each of which is a sequence of characters. A file used for storing business-oriented data or part of a database is typically viewed as being comprised of records, each of which is comprised of fields. A record describes an entity (e.g., an employee, a business transaction, a kind of product sold by a retail store, a section of an academic course), while a field describes one attribute of such an entity (e.g., an employee's SSN, the amount of money exchanged in a transaction, the number of units of a product in stock, the name of a course).
In order to read data from a file and to interpret it correctly, there must be some way to find the boundaries between adjacent records and to find, within each record, the boundaries between adjacent fields. (To keep things simple, for the moment we assume that all records in the file have the same basic structure in the sense that they all have the same collection of fields occurring in the same order.) How can this be accomplished?
To illustrate the problem, suppose that we have a file in which each record describes a person, including his/her name and address. Further suppose that two adjacent records describe John Ames and Mary Johnson, as follows:
Name: | Name: First_Name: John | First_Name: Mary Last_Name: Ames | Last_Name: Johnson Address: | Address: Street_Addr: 234 Elm St. | Street_Addr: RD #4 City: Stillwater | City: Moscow State: OK | State: PA Zip_Code: 74051 | Zip_Code: 18444
Now consider precisely that segment of the file in which these records are stored. If the records are stored in consecutive bytes, as are the fields within each record (and all data is assumed to be encoded as character strings), this portion of the file will be as follows:
This clearly presents a problem for a programmer attempting to write software to interpret this data, because there is no way to determine where one record ends and another begins. Nor is there any way to tell, within a record, where one field ends and another begins.
First let's consider how to address this problem with respect to fields, and for the moment assume that, somehow, we can isolate records from one another.
Fixed-length Fields: One way to enable a program to distinguish one field from another is to fix the length of each field. That is, for each field appearing in the records of the file (e.g., First_Name, Last_Name, Street_Addr), we format the file so that, in all of its records, the number of bytes used for storing the value of that field is the same.
COBOL (as well as a number of other languages) supports this approach directly with its data declaration mechanism, in which each field's length is explicitly specified. If we choose the lengths of the six atomic fields in our example to be 12, 15, 16, 13, 2, and 5, respectively, the records would be laid out as follows:
1 2 3 4 5 6
123456789012345678901234567890123456789012345678901234567890123
+---------------------------------------------------------------+
|John Ames 234 Elm St. Stillwater OK74051|
+---------------------------------------------------------------+
1 2 3 4 5 6
123456789012345678901234567890123456789012345678901234567890123
+---------------------------------------------------------------+
|Mary Johnson RD #4 Moscow PA18444|
+---------------------------------------------------------------+
The bytes following that which stores the last meaningful character in a field are typically filled with spaces or null characters, although these are not the only possibilities. We refer to this as padding. Clearly, with this approach, it is easy to find the beginning (and end) of each field within a record, assuming that we can find the beginning of the record itself.
On the negative side, this approach tends to result in storage space being wasted, at least in those fields containing values that vary widely in terms of how much space is "needed" to encode them. For example, some last names are much longer than others. Hence, we decided to allocate fifteen bytes to the Last_Name field, even though most values we would expect to occupy that field can be stored in several fewer bytes. If we were to allocate fewer bytes to that field, we would waste less space, but only at the possible cost of having to truncate (or otherwise abbreviate) the values occupying that field in one or more records. Whether or not a truncated value is acceptable in a particular field depends upon the application, of course.
Notice that, in our sample records, approximately half of the 126 bytes allocated for storing the two records are occupied by spaces serving as padding between fields.
Variable-length Fields: To save space, as well as to avoid the occasional need for truncating field values, we would prefer to allocate to each field precisely as much space as is needed to store its value, no more and no less. Figure 1 suggests that it may not be possible to achieve this without sacrificing the ability to find the boundaries between fields, however. Luckily, it turns out that, if we are willing to use a little extra space for each field, we need not make this sacrifice after all. Two approaches are as follows:
Notice that we did not place a delimiter between the State and Zip_Code fields. This illustrates that, when a field is naturally fixed in length, we can safely omit a trailing delimiter, assuming that any software that interprets the record "knows" the appropriate field length.
The main problem with using delimiters (or separators, as they are sometimes called) is that we must identify some character (or, more generally, some byte value, or byte sequence) that cannot occur as part of the value of a field. In some cases, this may not be possible.
|4 4 11 10JohnAmes234 Elm St.StillwaterOK74051|
+---------------------------------------------+
Note that the spaces between length indicators are there only for the reader's benefit; they are not really part of the stored data.
Following the latter approach, the record would be like this:
|4John4Ames11234 Elm St.10StillwaterOK74051|
+------------------------------------------+
The astute reader may be asking how we can find the boundaries between length indicators (in the former case) or between the length indicator and the field it describes (in the latter). We could, of course, use delimiters, but then the length indicators themselves would be superfluous! A more likely answer is that, for each length indicator, we would fix its length. Furthermore, their values would be encoded in binary mode rather than text. For example, the value eleven (giving the length of Street_Addr) would be stored not in two bytes as the string "11" (with each byte holding the value 00110001, which is the extended-ASCII representation of the character '1') but rather in a single byte with value 00001011 (which is the 8-bit binary representation of eleven). By using binary representation, a single byte is capable of encoding any integer value in the range 0..255, which covers the range of lengths of most fields. If we anticipate field values exceeding a length of 255, two bytes could be allocated to hold that field's length indicator, yielding a possible range of 0..65535.
<record>
<Name>
<FirstName>John</Firstname>
<LastName>Ames</Lastname>
</Name>
<Address>
<StreetAddress>234 Elm St.</StreetAddress>
<City>Stillwater</City>
<State>OK</State>
<ZipCode>74051</ZipCode>
</Address>
</record>
This approach limits us to encoding everything as text and it also
takes a lot of space due to all the tags.
So far we've explored how to encode fields so as to make it possible to find the boundaries between them, which is a prerequisite for being able to interpret the contents of a record. Now we turn to the question of how to encode records so that the boundaries between them can be found. Not surprisingly, the answers are quite similar to those that apply to fields.
Fixing the length of records does not, however, imply that all fields within them must be of fixed length. Consider Linda Ames-Rumpelstiltskin, John's sister, who chose to combine her maiden name with her husband's name, and who lives next door to John:
Name: First_Name: Linda Last_Name: Ames-Rumpelstiltskin Address: Street_Addr: 236 Elm St. City: Stillwater State: OK Zip_Code: 74051Using the fixed-length field approach from above, and truncating her last name to 15 bytes, corresponding to the length of the Last_Name field, we get
1 2 3 4 5 6
123456789012345678901234567890123456789012345678901234567890123
+---------------------------------------------------------------+
|Linda Ames-Rumpelstil236 Elm St. Stillwater OK74051|
+---------------------------------------------------------------+
Recall that the lengths of the first four fields are 12, 15, 16, and 13
bytes, respectively, for a total of 56 bytes. Suppose that, instead
of fixing their lengths individually, we fixed their total length to
be 56 (including delimiters or length indicators). In that case, we
could store her record, without truncating any field values, as follows:
1 2 3 4 5 6
123456789012345678901234567890123456789012345678901234567890123
+---------------------------------------------------------------+
|Linda$Ames-Rumpelstiltskin$234 Elm St.$Stillwater OK74051|
+---------------------------------------------------------------+
Suppose, for example, that the first four records in the file had lengths 45, 82, 63, and 39. If we wished to access the 4th record (counting starting at zero), we could add the lengths of records 0 through 3, giving 229, which would tell us that record 4 begins at byte 229 of the file. Of course, more generally, to calculate the location at which the the k-th record begins, one would add the lengths of records 0 through k-1.
To avoid all that adding, we could store the cumulative record lengths rather than the individual record lengths. In our example, these would be 45, 127, 190, and 229. Counting from zero, the k-th such value tells us the location at which the k+1-st record begins. (For example, the 3rd value in our list is 229, which gives the starting location of the 4th record.)
Because record lengths could very well be in the hundreds or thousands of bytes, length indicators for records are likely to be chosen to be at least two bytes long. Cumulative record lengths could very well require four bytes, which allows for values up to 232 - 1 (slightly beyond four billion).
Some files, once created, remain static. However, the more interesting case is the one in which a file's contents change during its lifetime. The standard way of describing changes to a (record-oriented) file is via the operations Add, Change, and Delete. (Alternative names are, respectively, Insert, Modify, and Remove.) As suggested by their names, Add causes a new record to be placed into the file, Change causes an existing record in the file to be modified, and Delete causes a record in the file to be removed therefrom. One can express a Change operation as a Delete followed by an Add, so it is not absolutely necessary to consider Change at all.
Files come in two main varieties with respect to the ordering of their records: ordered and unordered. By the former is meant a file whose records are ordered according to some well-defined criterion relating to the values in their fields. (For example, a file having records describing persons (including their names) could be ordered so as to be in alphabetical order by name; a file whose records describe events of some sort (including the time at which each one occurred) could be ordered according to when they occurred.) Certain benefits accrue from ordering the records in a file with respect to a certain field, one of which is that searches based upon that field can be carried out more quickly.
A file whose records are not ordered by any particular criterion is sometimes referred to as a pile or a heap. (We will refrain from using the latter term because it has a widely-used alternative meaning in computer science.) In a sense, such a file can be viewed logically as a set of records rather than as a sequence of records.
So we can distinguish files across two dimensions with respect to their records ---fixed- vs. variable-length and ordered vs. unordered (pile)--- giving us four different varieties of files. Let's explore, with respect to each of the four varieties of files, how we could carry out each of the insert, delete, and change operations.
Pile file with fixed-length records: There are two standard approaches, one in which the file is maintained in a perpetually compacted state, the other involving the use of tombstones (i.e., "empty slots" of storage space formerly occupied by records) and periodic compaction. (A file is in a compacted state if there are no empty slots between adjacent records.)
An add operation is carried out by writing the new record at the end of the file (i.e., immediately after the last record).
Over time, as more and more operations are performed upon the file, some of them deletions, the number of tombstones scattered around the file will keep increasing. At some point, we will want to reclaim that "wasted" space, which we can do via file compaction, which basically means making a new copy of the file, but with all the tombstones removed from it.
As for the Change operation, under either approach we would probably carry it out "in place", meaning that the record would be modified and then written back into the same place.
Pile file with variable-length records: Things get more interesting with variable-length records. To perform a delete, for example, one cannot simply move the file's last record into the space occupied by the record to be deleted, for the simple reason that it might not fit! Or, even if the last record does indeed fit into the space occupied by the record being deleted, what do we do with any "extra" space that is left over?
These observations tell us that the strategy of keeping the file "perpetually compacted", as described above for the case of fixed-length records, is probably not practical for a file of variable-length records. On the other hand, employing tombstones and periodic compaction would work here just as in pile files with fixed-length records.
When applied to a large file, periodic compaction is not only an expensive operation, but it may also require the file to be taken "offline" while it is being compacted, a situation that may not be acceptable. Also, the quantity of space occupied by tombstones may tend to become unacceptably large in the time leading up to a compaction.
In order to slow (eliminate?) the growth of space wasted by tombstones, we could carry out an add by writing the new record in place of one of the tombstones, rather than at the end of the file. The main difficulty in this approach is to find a suitable tombstone to be replaced by the new record! The minimum requirement would be that the tombstone's length is at least as large as the record's. Various more specific criteria have been suggested, however, including Best-fit and Worst-fit, the former of which says that the tombstone to be replaced should be the smallest one large enough to hold the new record. The latter says that the chosen tombstone should be the largest one! (Another strategy is First-fit, which says to use any tombstone that is large enough, which is to say the "first" suitable one encountered while searching for one.)
Your intuition may tell you that Best-fit makes the most sense, but before you draw that conclusion, consider what happens to the "extra" storage space inside a tombstone that is larger than the new record written in its place. Does that space itself become a new tombstone? If so, Best-fit will, over time, have a tendency to cause many small tombstones to exist. On the other hand, Worst-fit will tend to result in fewer and larger tombstones. The fewer tombstones there are, the easier it is to manage them. Also, a small tombstone tends to be more difficult to utilize, simply because it is not large enough to accommodate most records.
The next natural question is to ask how we go about managing the tombstones in a file so as to make efficient the task of finding a suitable tombstone when a record is added and the task of recording the existence of a new tombstone when a record is deleted (or when a record is written in place of a larger tombstone, leaving a smaller residual tombstone).
Among the possible approaches are to maintain a linked list of tombstones, with the links stored within the tombstones themselves. Another approach would be to store the addresses (and lengths) of the tombstones in a separate table, thereby making it unnecessary to go hopping from place to place within the file (causing lots of costly disk accesses) simply to follow the links until a suitable tombstone is found. Of course, that table itself would have to be stored in the file (in what is sometimes known as a header record, which is a generic term often used to describe meta-data stored at the beginning of a file) or in a separate file.
Maintenance becomes much more difficult with ordered files, because we lose the freedom to put records wherever they "fit".
Ordered files with fixed-length records:
Here is perhaps the most straightforward approach:
Note: Keeping the file perpetually compacted would be inefficient, as compacting the file each time a record is deleted (meaning that all records following the deleted record must be shifted toward the beginning of the file) would necessitate that, on average, half the file be rewritten. Moreover, this approach would result in the Add operation being expensive, too, becase, to make room for a new record, on average half of the file would have to be rewritten. End of note.
In the unusual case of a file in which deletes occur more often than adds (which would result in lots of tombstones lying around), one can use occasional file compaction to get rid of the tombstones.
An alternative approach uses an Overflow File:
Ordered file with variable-length records: The approaches described above apply even when records are not of fixed length. It is slightly more complicated, however, due to the fact that the existence of a tombstone near to where a record is to be added does not guarantee that "local" shifting will work, because the tombstone may be smaller than the new record. Hence, depending upon the length of the record to be added, it may be necessary to find two or more nearby tombstones before carrying out a shift.
The previous section began by suggesting that maintaining an ordered file is more difficult than maintaining a pile file because in the former, unlike the latter, there are strict constraints upon where records can be placed, relative to one another. Specifically, the i-th record must come before the (i+1)-st record. That is, if the i-th record occupies bytes k through m of the file, record i+1 must begin at byte m+1 (or later, in the case that there is unused space (e.g., one or more tombstones or a record separator character) in between the two records).
But this assumes that records in an ordered file must appear, physically, in an order consistent with their logical order. Must we really make such an assumption? No! The idea of divorcing physical ordering from logical ordering is one with which you already should be familiar, in the form of pointer/reference-based implementations of the list and tree concepts. In a (singly-linked) list, each "node" contains a data item together with a reference/pointer (often called next) to the (node containing the) item that logically follows it. The reason for having the next pointer is precisely because the physical location of a node's logical successor is totally independent of its own location.
We can apply the same idea to an ordered file by organizing it as a linked list. For example, the data associated to each node in the list could be a single record. The reference/pointer in each node could take the form of a byte offset, perhaps, or a physical record number (in the case that such a number could be mapped easily into a byte offset, such as when the records were all of the same length).
Rather than (or in addition to) storing a next pointer in each node, one could maintain a table (or "record index") that provides a logical-to-physical mapping from logical record positions to their corresponding physical locations (as byte offsets from the beginning of the file, perhaps). To illustrate, consider this picture:
The file The record index The tombstone index
-------- ---------------- -------------------
recLoc recLen tsLoc tsLen
+------+ +--------+--------+ +------+-------+
0 | | 0 | 347 | 61 | 0 | 72 | 42 |
| Rec3 | +--------+--------+ +------+-------+
| | 1 | 114 | 89 | 1 | 503 | 9 |
+------+ +--------+--------+ +------+-------+
72 | | 2 | 452 | 51 | 2 | 408 | 44 |
| | +--------+--------+ +------+-------+
| | 3 | 0 | 72 |
+------+ +--------+--------+
114 | | 4 | 203 | 44 |
| Rec1 | +--------+--------+
| |
+------+
203 | |
| Rec4 |
| |
+------+
347 | |
| Rec0 |
| |
+------+
408 | |
| |
| |
+------+
452 | |
| Rec2 |
| |
+------+
503 | |
| |
| |
+------+
512
This is meant to depict a situation in which the file
(which has had 512 bytes allocated to it) has five records
which happen to be stored beginning at the bytes with offsets
347, 114, 452, 0, and 203, respectively.
(By offset we mean the location relative to the beginning of the
file. Recall that we are viewing a file simply as a sequence of bytes.)
The picture indicates that the record lengths are stored in the record index, too; we could, instead, store that information in a byte (or two, depending upon the maximum record length) prefixing each record.
Depending upon how insertions and deletions are carried out, we may also want to maintain a table (as shown) (or a linked list) indicating the locations and sizes of tombstones (i.e., chunks of space within the confines of the file that are logically empty and hence into which records can be placed/shifted during the insertion of a new record).
Following the scenario just described, the obvious algorithm for processing the records in the file, in logical order from first to last, is as follows:
f.open(input); // open the file for input
// now read each record and process it
for (int i = 0; i != recLoc.length(); i++) {
f.seek(recLoc[i]); // seek to the byte where record #i begins
record = f.read(recLen[i]); // read the record into a variable
process(record); // process the record
}
f.close() // close the file
Exactly what we mean by "process a record" is not important; simply assume that it entails examining the contents of the record and possibly carrying out some computation that depends upon those contents. Also assume that the application requires that the records be examined in their logical order.
The seek operation is to be understood as positioning the file pointer at the byte with the specified offset. It does not directly translate into the occurrence of a disk seek, but, in combination with whatever I/O operation is carried out next (most likely, a read or write), it could very well lead to a disk seek. The read operation transfers the specified number of bytes, beginning at the byte where the file pointer is positioned, into an array of bytes.
Assuming that the physical locations of the records within the file bear no particular relationship to their logical positions within the file (which, even if the file began with its logical and physical orderings being the same, is not an outlandish scenario, assuming that the file's current state is the result of thousands or millions of record insertions and deletions), the number of disk block reads occurring during execution of this algorithm will approach the number of records in the file.
As as example, suppose that records 35, 57, 145, 211, 214, and 298 all happen to reside in the same disk block. (These numbers refer to the records' logical positions within the file.) Then, on the 36-th iteration of the loop, that disk block will be read in. (Remember that we number the records starting at zero.) On the 58-th iteration, the same disk block will be read in again. And again on the 146-th iteration. And so on and so forth. Now, it may be that, on the 215-th iteration, the disk block will still be in an I/O buffer in RAM (having been put there during the 212-th iteration) and hence may not have to be read again during that iteration. But still, this disk block will be accessed from disk five separate times. And each time, the vast majority of data in the block will be ignored.
What we have done, then, by completely divorcing the logical positions of the records from their respective physical positions, is to make the insert and delete operations more efficient at the expense of making the read-file-sequentially operation much less efficient. Can we adjust our approach to arrive at a happy medium?
The short answer is yes. The solution is to coarsen the level of granularity of the nodes in the linked list of records. That is, rather than having each node in the list correspond to a single record, make it instead correspond to a collection of several logically consecutive records. We refer to such a node as a bucket.
To illustrate the idea of managing a file by treating it as a collection of buckets, we use an example. To simplify the management task, we choose a uniform bucket size of 512 bytes. Hence, the first 512 bytes (bytes 0 through 511) of storage space allocated to the file holds one bucket, the next 512 bytes holds another, and so on and so forth. Note that the physical order of the buckets does not necessarily correspond to their logical order. For example, we may have this situation:
0 512 1024 1536 2048 +-----------+-----------+-----------+-----------+-----------+ | bucket #2 | bucket #1 | bucket #4 | bucket #0 | bucket #3 | +-----------+-----------+-----------+-----------+-----------+
Here, the physical location of the bucket containing the first several records in the file (bucket #0) is at offset 1536, which is preceded by three buckets that, logically, come later. The corresponding logical-to-physical bucket map, which we call log2PhysBuckMap, could be as follows:
0 1 2 3 4
+---+---+---+---+---+
log2PhysBuckMap | 3 | 1 | 0 | 4 | 2 |
+---+---+---+---+---+
The physical bucket numbers are stored as entries in the table rather
than the byte offsets. (To compute the byte offset from the physical
bucket number, simply multiply it by the length of a bucket, in this
case 512.)
With this approach, the algorithm for processing all the records in the file becomes
f.open(input); // open the file for input
// read each bucket and process the records it holds
for (int i=0; i != log2PhysBuckMap.length; i++) {
f.seek(512 * log2PhysBuckMap[i]); // seek to beginning of bucket #i
bucket = new Bucket(f.read(512)); // read contents of bucket
for (int j=0; j != bucket.numRecs(); j++) {
record = bucket.getRec(j); // read j-th record from bucket
process(record);
}
}
f.close(); // close the file
This (pseudo-)code assumes that there is a class called Bucket having a constructor that, given the (raw) contents of a bucket, creates an object (representing a bucket) that has certain capabilities, among them to report how many records it contains and to allow those records to be retrieved. (The assumption here is that the records within a bucket are numbered (starting with zero, say) according to their logical order and that the method call getRec(j) returns the jth record in the bucket.) Of course, for these operations to be implementable requires that a bucket's raw contents are such that it is possible to find boundaries between records, etc.
Compare the number of disk seeks carried out in executing this algorithm and the algorithm given earlier. In this one, the number of seeks corresponds to, at worst, the number of buckets comprising the file (assuming, of course, that each bucket is stored in a contiguous region of the disk so that it can be read in full after a single disk seek.) In executing the earlier algorithm, the number of seeks corresponded to the number of records. Assuming that buckets contain, on average, say fifty records, this reduces the number of disk seeks by a factor of about fifty. That is a significant improvement.
The advantages of organizing a file as a collection of buckets carry over to the add, change, and delete operations. More stuff to be added here.
As noted above, in order for a program to be able to take the 512 (or whatever number of) bytes comprising a bucket and to extract individual records from it, there must be some means of determining where, within the bucket, each record is stored. In accord with the ideas presented earlier in this document, two possibilities are to use delimiters/separators or length indicators. Because we prefer not to impose any restrictions upon what data might appear in a record, and because the use of delimiters requires that there be some byte value (or sequence of byte values) that cannot "legally" occur within a record (and hence can be used as a delimiter), we here focus upon using length indicators.
Perhaps the most obvious way to lay out the records in a bucket is depicted below. Here we have assumed that a single byte is sufficient for storing the number of records in the bucket as well as each record's length. As a byte has 256 possible values (and hence naturally can represent any integer in the range 0 to 255), this means that we are restricting the number of records in a bucket to be at most 255 and, likewise, the length of any record. If either of these assumptions is too restrictive, we can use two bytes instead (or whatever number is sufficient for our purposes). Another possibility, which we shall not explore here, is to use a variable-length encoding scheme for these values.
+-+-+-+-+-----+-+---------+---------+---------+-----+-------------+------------+ | | | | | ... | | 0th rec | 1st rec | 2nd rec | ... | (k-1)st rec | free space | +-+-+-+-+-----+-+---------+---------+---------+-----+-------------+------------+ ^ ^ ^ ^ ... ^ | | | | ... | | | | | ... length of (k-1)-st record in bucket (1 byte) . . . . ... . . . . ... | | | length of 2nd record in bucket (1 byte) | | length of 1st record in bucket (1 byte) | length of 0th record in bucket (1 byte) number of records (k)in bucket (1 byte)
Under this scenario, the first byte of the bucket is used for storing the number, k, of records contained in the bucket, the following k bytes are used for storing the lengths of those records, and the remaining bytes are used for storing the records themselves, in order and with no space between them. In all likelihood, there will be some free space at the end of the bucket. In effect, each bucket has a single tombstone.
Suppose that a new record, of length r, is to be inserted in the j-th position in such a bucket. (That is, the new record is to be placed into the bucket so that it is the j-th record and what had been the j-th record now becomes the (j+1)-st, etc., etc.) Such an operation cannot be carried out unless there are at least r+1 bytes of free space in the bucket, r for holding the new record and one for holding its length.
Consider what must be done to carry out this operation. First, the bucket must be read into RAM from secondary storage (unless, of course, it is already there). Then, manipulating the contents of that chunk of RAM in order to effect the insertion, we can
Having modified the bucket in RAM, the next step would be to write it back into the file, replacing its old contents.
The reader should have no trouble figuring out how to carry out a delete operation.
Now consider how to retrieve the j-th record from a bucket. As with a record insertion, the first thing to do is to get the bucket into RAM. Suppose that we do that, and we view it as an array b[] of bytes. We need to calculate beg and end such that b[beg..end] is the segment of b[] that holds record j. Having done that, we can make a copy of those bytes and provide them to the client who requested the record. Here's how to calculate beg and end:
k := b[0]; // k is # of records stored in bucket m := b[1] + b[2] + ... + b[j]; // m is sum of lengths of records 0..j-1 beg := k + 1 + m; end := beg + b[j+1] - 1;
Another, perhaps less obvious, way to organize the bucket is depicted here:
+---------+---------+---------+-----+------------+------------+-+-----+-+-+-+-+
| 0th rec | 1st rec | 2nd rec | ... |(k-1)st rec | free space | | ... | | | | |
+---------+---------+---------+-----+------------+------------+-+-----+-+-+-+-+
^ ^ ^ ^ ^
| ... | | | |
length of (k-1)-st record in bucket (1 byte) ...
... . . . .
length of 2nd record in bucket (1 byte) | | |
length of 1st record in bucket (1 byte) | |
length of 0th record in bucket (1 byte) |
number of records (k) in bucket (1 byte)
Under this scenario, the records occupying the bucket are stored at the "left end" of the bucket, contiguously, in an order corresponding to their logical order. So as to be able to find the boundaries between adjacent records, their lengths are stored at the "right end" of the bucket, in reverse order. (The very last thing stored in the bucket is the number of records it holds.) The advantage of this scenario, in comparison to the one shown earlier, is that less data need to be shifted around to carry out insertions and deletions. Because this data shifting occurs in RAM (as opposed to much slower secondary storage), it is not all that important.
To be continued ....