Canonical Huffman Coding

In standard Huffman coding, the compressor builds a Huffman Tree based upon the counts/frequencies of the symbols occurring in the file-to-be-compressed and then assigns to each symbol the codeword implied by the path from the root to the leaf node associated to that symbol. For example, if we adopt the convention that an edge from a node to its left (respectively, right) child is labeled 0 (resp., 1), then if the path from the root to a particular leaf is left, left, right, left, right, left, then the codeword assigned to the associated symbol will be 001010.

Canonical Huffman Coding recognizes that the essential information provided by a Huffman Tree is the mapping from symbols to their codeword lengths; the particular bit patterns of the codewords are secondary and can be computed independently of the tree. Indeed, in Canonical Huffman Coding the set of codewords that is employed depends solely upon the distribution of codeword lengths. This set is chosen so as to satisfy not only the familiar prefix-freeness property (i.e., no codeword is the prefix of any other) but also this property:

Longer-is-Lesser property:

If x and y are codewords, with |x| > |y|, then x' ≺ y, where x' is the prefix of x of length |y|.

Using standard notation, |z| denotes the length of z and ≺ denotes the "lexicographically less than" relation. Lexicographic ordering is essentially the same as alphabetic ordering.

With respect to bit strings u and v, to say that u ≺ v is to say that either u is a proper prefix of v or else the leftmost bit in which they differ is a 0 in u and a 1 in v. For example, 10010110100 because of the bits in the 3rd position (counting from one at the left). (For essentially the same reason, the word "carwash" precedes "cattle" in the dictionary.)

Now, if A and B are leaves in a Huffman Tree (in which edges to left (respectively, right) children are labeled 0 (resp., 1)) with corresponding codewords x and y (i.e., the labels on the edges along the path from the root to A (respectively, B) spell out x (resp., y)) then x ≺ y is equivalent to A being to the left of B in the tree.

Thus, in order for the set of codewords induced by a Huffman Tree to satisfy the Longer-is-Lesser property, the tree must have this property:

Lefter-is-Deeper property:

If A and B are leaves and A is to the left of B, then depthOf(A) ≥ depthOf(B). (The depth of a node is its distance from the root.)

But we can take any Huffman Tree and, by a judicious sequence of swaps of subtrees rooted at nodes of the same depth, arrive at another Huffman Tree having the Lefter-is-Deeper property and having a set of codewords whose length distribution is the same as that in the original tree.

Even though such a Huffman Tree transformation process is possible, it's not necessary to do it that way. A better approach is to take the codeword length distribution of the original tree and to build a Lefter-is-Deeper tree directly therefrom. Indeed, for any given distribution of lengths, there is only one possible tree structure.

For example, suppose that the symbol frequencies led us to build one of the many Huffman Trees in which the codeword length distribution was as on the left below. Then the corresponding (unique) Lefter-is-Deeper tree (where each leaf's depth is explicitly indicated) is in the middle, and the resulting set of codewords (listed in lexicographically increasing —and thus length descending— order) is to the right:

Codeword Length Distribution
LengthNumber
62
51
43
34
21
Lefter-is-Deeper Huffman Tree

                            *
                          /   \ 
                         /     \ 
                        /       \ 
                       /         \ 
                      /           \ 
                     /             \ 
                    /               \
                   /                 \
                  /                   \
                 /                     \
                *                       *
              /   \                   /   \
             /     \                 /     \
            /       \               /       \
           /         \             /         \
          *           *           *           *
         / \         / \         / \          2
        /   \       /   \       /   \
       /     \     /     \     /     \
      *       *   *       *   *       *
     / \     / \  3       3   3       3
    *   *   *   *
   / \  4   4   4
  *   *
 / \  5
*   *   
6   6
Codewords
000000
000001
00001
0001
0010
0011
010
011
100
101
11

Significantly, the Longer-is-Lesser set of codewords that arises from a Lefter-is-Deeper tree has some interesting properties when you interpret each codeword as a natural number (in accord with the binary numeral system).

Longer's-Prefix-is-Lesser property:

Let x and y be codewords, with |x| > |y|, and let x' be the prefix of x of length |y|. Then #(x') < #(y), where # is the function that maps bit strings into their numerical equivalents according to the standard binary numeral system. (E.g., #(1001) = 9, #(00110) = 6.)

Consecutive-Values property:

For any particular length, the codewords of that length represent a consecutive range of natural numbers.

In our example, the codewords of length four represent the range 1..3 and those of length three represent 2..5.

Half-of-Successor property:

For all k less than the maximum length among codewords, the smallest codeword of length k has value ⌈(m+1)/2⌉, where m is the value of the largest codeword of length k+1.

All this is quite interesting, of course, but is there any advantage in employing a set of codewords that arises from a Lefter-is-Deeper tree? Answer: Yes.

Briefly, we list them:


Decompression

The biggest gains to come from using Canonical Huffman Coding are in performing decompression, so we look at those first.

Because of the constrained nature of the codeword set (in particular, the Longer's-Prefix-is-Lesser and Consecutive-Values properties), it turns out that, in place of storing an explicit representation of the Huffman Tree, all that the decompresser needs are two arrays, minCW[] and CW2Symbol[][]. For each relevant value of i, minCW[i] contains the (numeric) value of the lexicographically smallest codeword of length i. For each pair of relevant values of i and j, CW2Symbol[i][j] is the native code of the j-th symbol having a codeword of length i.

For the example tree above, these arrays would look like this, where in each element of minCW[] is shown not only the numeric value but also the corresponding codeword.

       minCW                 CW2Symbol
       -----                 ---------
   +-----------+            +---+
 2 | 3 (11)    |          2 |'e'|
   +-----------+            +---+---+
 3 | 4 (011)   |          3 |'c'|'i'|
   +-----------+            +---+---+---+---+---+
 4 | 3 (0011)  |          4 |'a'|'f'|'l'|'o'|'j'|
   +-----------+            +---+---+---+---+---+
 5 | 2 (00010) |          5 |'b'|'k'|'h'|'g'|
   +-----------+            +---+---+---+---+
 6 | 0 (000000)|          6 |'p'|'n'|'m'|'d'|
   +-----------+            +---+---+---+---+
                              0   1   2   3   4

Of course, the decompresser must make use of the metadata at the beginning of the compressed file to construct these arrays. (How that is accomplished is addressed later.) Having done that, its job is basically that described by this high-level algorithm:

while (hasMoreBits()) {
   BitString x := nextBit();
   while (!isCodeword(x)) {
      x := x · nextBit();   // append next bit onto rear of x
   }
   emit nativeCodeOf(x);  // emit the native code of the 
}                         // symbol whose codeword is x

What is not obvious is how to implement isCodeword() and nativeCodeOf() making use of nothing but the data stored in arrays minCW[] and CW2Symbol[][].

The solutions to these two problems rely, respectively, upon the guarantees that the set of codewords possesses the Longer's-Prefix-is-Lesser and Consecutive-Values properties!

To illustrate how we can tell whether the value of x (in the algorithm above) is a codeword, suppose that z is a codeword and let zk be the prefix of z of length k, for all k in the range 1..|z|. By the Longer-is-Lesser property of the codewords, we have that ziminCW[i] for all i<|z|. Trivially, we also have that z ≽ minCW[|z|]. That is, every proper prefix of z is lexicographically less than the smallest codeword of its length, but z itself is (obviously) lexicographically greater than or equal to the smallest codeword of its length.



We will assume, of course, that the file produced by the compresser begins with metadata describing a symbol-to-codeword mapping that is consistent with a Lefter-is-Deeper Huffman Tree. We consider two possible ways in which the metadata might describe that mapping, one in which the Huffman tree is described explicitly and the other in which it is described implicitly. (Both of these possibilities were mentioned earlier.)

Explicit Tree Representation

Here we assume that the metadata begins with a bit string that describes the structure of the (Lefter-is-Deeper) Huffman tree according to a preorder traversal in which each visit to an interior (respectively, leaf) node produces a 0 (respectively, 1). (This manner of encoding a Huffman tree should be familiar to the reader.)

Following that would be a list of the native codes of the symbols, going from the symbol with the lexicographically smallest codeword (corresponding to the tree's leftmost leaf) to the one with the largest (corresponding to the tree's rightmost leaf). Of course, this list of native codes would have to be parsable, meaning that the boundaries between the elements could be determined algorithmically. (If the native codes are of a known fixed length, that would not be a problem; otherwise one could precede each native code with a length indicator in Elias-Gamma form, for example.) Here we are not concerned with the details of how to encode the list of native codes, however.

Implicit Tree Representation

Here we assume that the metadata begins with an encoding of the sequence ⟨ minLen, maxLen, cminLen, cminLen+1, ..., cmaxLen, where minLen and maxLen are the minimum and maximum, respectively, lengths of codewords, and, for each i, ci is the number of (symbols having) codewords of length i. As with the explicit tree representation, that would be followed by the list of the native codes of the symbols.