Canonical Huffman Coding

In standard Huffman coding, the compressor builds a Huffman Tree based upon the counts/frequencies of the symbols occurring in the file-to-be-compressed and then assigns to each symbol the codeword implied by the path from the root to the leaf node associated to that symbol. For example, if we adopt the convention that an edge from a node to its left (respectively, right) child is labeled 0 (resp., 1), then if the path from the root to a particular leaf is left, left, right, left, right, left, then the codeword assigned to the associated symbol will be 001010.

Canonical Huffman Coding recognizes that the essential information provided by a Huffman Tree is the mapping from symbols to their codeword lengths; the particular bit patterns of the codewords are secondary and can be computed independently of the tree. Indeed, in Canonical Huffman Coding the set of codewords that is employed depends solely upon the distribution of codeword lengths. This codeword set is chosen so as to satisfy not only the familiar prefix-freeness property (i.e., no codeword is the prefix of any other), which guarantees that the deciphering delay is zero, but also this property:

Longer-is-Lesser property:
If x and y are codewords, with |x| > |y|, then x' ≺ y, where x' is the prefix of x of length |y|.

Using standard notation, |z| denotes the length of z and ≺ denotes the "lexicographically less than" relation. Lexicographic ordering is essentially the same as alphabetic ordering.

With respect to bit strings u and v, to say that u ≺ v is to say that either u is a proper prefix of v or else the leftmost bit in which they differ is a 0 in u and a 1 in v. For example, 10010110100 because of the bits in the 3rd position (counting from one at the left). (For essentially the same reason, the word "carwash" precedes "cattle" in the dictionary.)

Now, if A and B are leaves in a Huffman Tree (in which edges to left (respectively, right) children are labeled 0 (resp., 1)) with corresponding codewords x and y (i.e., the labels on the edges along the path from the root to A (respectively, B) spell out x (resp., y)) then x ≺ y is equivalent to A being "to the left" of B in the tree. (Let C be the nearest common ancestor of A and B in the tree. For A to be to the left of B means that A is in the left subtree of C and B is in the right subtree of C.)

Thus, in order for the set of codewords induced by a Huffman Tree to satisfy the Longer-is-Lesser property, the tree must have this property:

Lefter-is-Deeper property:
If A and B are leaves and A is to the left of B, then depthOf(A) ≥ depthOf(B). (The depth of a node is its distance from the root.)

But we can take any Huffman Tree and, by a judicious sequence of swaps of subtrees rooted at nodes of the same depth, arrive at another Huffman Tree having the Lefter-is-Deeper property and having a set of codewords whose length distribution is the same as that in the original tree.

Even though such a Huffman Tree transformation process is possible, it's not necessary to do it that way. A better approach is to take the codeword length distribution of the original tree and to build a Lefter-is-Deeper tree directly therefrom. Indeed, for any given distribution of lengths, there is only one possible tree structure.

For example, suppose that the symbol frequencies led us to build one of the many Huffman Trees in which the codeword length distribution was as on the left below. Then the corresponding (unique) Lefter-is-Deeper tree (where each leaf's depth is explicitly indicated) is in the middle, and the resulting set of codewords (listed in lexicographically increasing —and thus length descending— order) is to the right:

Codeword Length Distribution
LengthNumber
62
51
43
34
21
Lefter-is-Deeper Huffman Tree

                            *
                           / \
                          /   \ 
                         /     \ 
                        /       \ 
                       /         \ 
                      /           \ 
                     /             \ 
                    /               \
                   /                 \
                  /                   \
                 /                     \
                *                       *
               / \                     / \
              /   \                   /   \
             /     \                 /     \
            /       \               /       \
           /         \             /         \
          *           *           *           *
         / \         / \         / \          2
        /   \       /   \       /   \
       /     \     /     \     /     \
      *       *   *       *   *       *
     / \     / \  3       3   3       3
    *   *   *   *
   / \  4   4   4
  *   *
 / \  5
*   *
6   6
Codewords
000000
000001
00001
0001
0010
0011
010
011
100
101
11

Significantly, the Longer-is-Lesser set of codewords that arises from a Lefter-is-Deeper tree has some interesting properties when you interpret each codeword as a natural number (in accord with the binary numeral system).

In preparation for describing these properties, we offer a few definitions:

Definition 1: For a bit string z, let #(z) be the natural number represented by z in accord with the binary numeral system. This function can be described recursively like this:

#(λ) = 0    (λ denotes the empty string)
#('0') = 0
#('1') = 1
#(zb) = 2·#(z)  +  #(b)   (z a bit string and b='0' or '1')

Examples: #(1001) = 9, #(00110) = 6, #(00110110) = 54.

Theorem 1: #(uv) = 2|v|·#(u) + #(v)
Proof: The proof is by mathematical induction on |v|:
For the basis, suppose that |v| = 0 (i.e., v = λ), so that uv = u. Then we have
  
    #(uv) 

=     < v is the empty string >

    #(u) 

=     < 1, 0 are the identities of ·, +, respectively >

    1·#(u) + 0

=     < 20 = 1 >

    20·#(u) + 0 =

=     < |v| = 0 (i.e., v = λ) and #(λ) = 0  >

    2|v|·#(u) + #(v)
For the induction step, let n ≥ 0 be arbitrary and assume as an induction hypothesis (IH) that the theorem holds whenever |v| = n. We show that it holds when |v| = n+1. Toward this end, suppose that v = wb, where w is a bit string of length n and b is either '0' or '1'. Then we have
    #(uv) 

=     < v = wb >

    #(uwb)

=     < Definition 1 >

    2·#(uw) + #(b)

=     < IH applied to uw (note that |w| = n) >

    2·(2|w|·#(u) + #(w)) + #(b)

=     < · distributes over + >

    2·2|w|·#(u) + 2·#(w) + #(b)

=     < Definition 1 >

    2·2|w|·#(u) + #(wb)

=     < 2·2k = 2k+1 >

    2|w|+1·#(u) + #(wb)

=     < |wb| = |w| + 1 >

    2|wb|·#(u) + #(wb)

=     < v = wb >

    2|v|·#(u) + #(v)

Theorem 2: For any r≥0 and any bit string v,
(a) #(0v) = #(v)
(b) #(0r) = 0
(c) #(10r) = 2r
(d) #(1r) = 2r - 1

Proofs (sketches):
(2a) Apply Theorem 1, choosing u = "0".
(2b) can be proved by induction on r using (2a).
(2c) follows from Definition 1 and (2b).
(2d) can be proved by induction on r using Theorem 1.

Definition 2: For a node X, Let P(X) be the bit string spelled out by the edge labels (0 for a "left" edge and 1 for a "right" edge) along the path from the root to node X.

Longer's-Prefix-is-Lesser property:
Let x and y be codewords, with |x| > |y|, and let x' be the prefix of x of length |y|. Then #(x') < #(y).

Proof: Because |x| > |y|, the leaf A satisfying P(A) = x must be to the left of the leaf B satisfying P(B) = y. Let C be the nearest common ancestor of A and B, and let P(C) = u. Then, for some bit strings v and w of the same length r ≥ 0, x' = u0v and y = u1w. We maximize #(x') by choosing v = 1r and minimize #(y) by choosing w = 0r. Thus, it suffices to show that #(u01r) < #(u10r), or, equivalently, #(u10r) - #(u01r) > 0.

    #(u10r) - #(u01r) 

=      < Theorem 1 >

    (2r+1·#(u) + #(10r)) - (2r+1·#(u) + #(01r)) 

=      < algebra >

    #(10r) - #(01r)

=      < Theorem 2b, 2c, 2d >

    2r - (2r - 1)

=      < algebra >

    1
>      < number theory! >

    0

Consecutive-Values property:
For any particular length, the codewords of that length represent a consecutive range of natural numbers.

Proof: It suffices to show that, in a left-to-right traversal of the fringe of the tree, any two "consecutive" leaves A and B of the same depth are such that #(P(A)) + 1 = #(P(B)).

Let node C be the nearest common ancestor of consecutive leaves A and B. Then the path from C to A (respectively, B) is composed of an edge labeled 0 (resp., 1) followed by r edges labeled 1 (resp., 0), for some r≥0. That is, letting P(C) = x, we have P(A) = x01r and P(B) = x10r for some r≥0. Now we show that #(P(A)) + 1 = #P(B):

   #(P(A)) + 1

=    < P(A) = x01r >

   #(x01r) + 1

=    < Theorem 1 >

   2r+1·#(x) + #(01r) + 1

=    < Theorem 2a, 2d >

   2r+1·#(x) + 2r - 1 + 1

=    < -1 and 1 cancel; Theorem 2c >

   2r+1·#(x) + #(10r)

=    < Theorem 1 >

   #(x10r)

=    < P(B) = x10r >

   #(P(B))

In our example, the three codewords of length four represent the numbers in the range 1..3 and the four codewords of length three represent the numbers in the range 2..5.

Half-of-Successor property:
For all k less than the maximum length among codewords, the smallest codeword of length k has value (m+1)/2, where m is the value of the largest codeword of length k+1.

Proof: Let A be the leaf corresponding to the largest codeword of length k+1 and B the leaf corresponding to the smallest codeword of length k. First we observe that, by the Consecutive Values Property, A must be the rightmost leaf of depth k+1. As such, A must be the right child of its parent. For suppose instead that it were the left child of its parent. Then either A has no sibling to the right, which contradicts the tree being full, or else A's right sibling is not a leaf, which contradicts the Lefter-is-Deeper property of the tree.

Let C be the nearest common ancestor of A and B, and suppose that P(C) = u. Node A must be the rightmost leaf in the left subtree of C and B must be the leftmost leaf in the right subtree of C. Thus, the path from C to A must follow an edge labeled 0, followed by some number r ≥ 0 edges labeled 1, followed by one edge labeled 1. (Recall that A is a right child.) Meanwhile, the path from C to B must follow an edge labeled 1 followed by that same number r edges labeled 0. In other words, there exists some r ≥ 0 such that P(A) = u01r+1 and P(B) = u10r. (Note that k = |u| + r + 1.)

To complete the proof, we must show that #(P(B)) = (#(P(A)) + 1) / 2

    (#(P(A)) + 1) / 2

=      < P(A) = u01r+1 >

    (#(u01r+1) + 1) / 2

=      < Theorem 1 >

    (2r+2·#(u) + #(01r+1) + 1) / 2

=      < Theorem 2a, 2d >

    (2r+2·#(u) + 2r+1 - 1 + 1) / 2

=      < -1 and 1 cancel >

    (2r+2·#(u) + 2r+1) / 2

=      < / distributes over +; 2m+1/2 = 2m >

    2r+1·#(u) + 2r

=      < Theorem 2c >

    2r+1·#(u) + #(10r)

=      < Theorem 1 >

    #(u10r)

=      < P(B) = u10r >

    #(P(B))

Because it sometimes happens that there exist k and d>1 such that there are codewords of length k and of length k+d, but none of lengths strictly between k and k+d, we find reason to introduce this generalization of the above theorem:

Generalized Half-of-Successor property:
If k and d>0 are such that there are codewords of length k and of length k+d, but none of any length strictly in between, then the smallest codeword of length k has value (m+1)/2d, where m is the value of the largest codeword of length k+d.

Proof: Let A be the leaf corresponding to the largest codeword of length k+d and B the leaf corresponding to the smallest codeword of length k.

Let C be the nearest common ancestor of A and B, and suppose that P(C) = u. Following reasoning similar to that in the proof of the basic version of this theorem, A must be the rightmost leaf in the left subtree of C and B must be the leftmost leaf in the right subtree of C. The depth of A is greater by d than the depth of B, and so the path from C to A must follow an edge labeled 0, followed by r+d edges labeled 1, for some r≥0, while the path from C to B must follow an edge labeled 1 followed by r edges labeled 0. In other words, there exists some r ≥ 0 such that P(A) = u01r+d and P(B) = u10r. (Note that k = |u| + r + 1.)

To complete the proof, we must show that #(P(B)) = (#(P(A)) + 1) / 2d

    (#(P(A)) + 1) / 2d

=      < P(A) = u01r+d >

    (#(u01r+d) + 1) / 2d

=      < Theorem 1 >

    (2r+d+1·#(u) + #(01r+d) + 1) / 2d

=      < Theorem 2a, 2d >

    (2r+d+1·#(u) + 2r+d - 1 + 1) / 2d

=      < -1 and 1 cancel >

    (2r+d+1·#(u) + 2r+d) / 2d

=      < / distributes over +; 2m+d/2d = 2m >

    (2r+1·#(u) + 2r

=      < Theorem 2c >

    2r+1·#(u) + #(10r)

=      < Theorem 1 >

    #(u10r)

=      < P(B) = u10r >

    #(P(B))


All this is quite interesting, of course, but is there any advantage in employing a set of codewords that arises from a Lefter-is-Deeper tree? Answer: Yes.

Briefly, we list them:


Decompression

The biggest gains to come from using Canonical Huffman Coding are in performing decompression, so we look at those first.

Because of the constrained nature of the codeword set (in particular, the Longer's-Prefix-is-Lesser and Consecutive-Values properties), it turns out that, in place of storing an explicit representation of the Huffman Tree, all that the decompresser needs are two arrays, minCW[] and CW2Symbol[][]. For each relevant value of i, minCW[i] contains the (numeric) value of the smallest codeword of length i. For each pair of relevant values of i and j, CW2Symbol[i][j] contains the native representation of the symbol to which has been assigned the j-th ranked codeword of length i. In a real-life application, a native representation would likely be a byte or a short sequence of bytes (e.g., representing an English word).

For the sake of making our example (as seen in the figure above) concrete, we use the lower case letters a through k to refer to the (native representations of the) eleven symbols in the source alphabet and we assign a codeword to each one. This assignment, as well as the corresponding values of the arrays minCW[] and CW2Symbol[], can be seen in the figure below. Note that each element of minCW[] actually contains the numeric value of a codeword (i.e., #(x) for codeword x), but we also show in parentheses the (binary) codeword itself.

Codeword  Symbol

000000      j
000001      g
00001       f
0001        h
0010        c
0011        b
010         d
011         a
100         k
101         i
11          e
       minCW

   +-----------+
 2 | 3 (11)    |
   +-----------+
 3 | 2 (010)   |
   +-----------+
 4 | 1 (0001)  |
   +-----------+
 5 | 1 (00001) |
   +-----------+
 6 | 0 (000000)|
   +-----------+
     CW2Symbol

   +---+
 2 |'e'|
   +---+---+---+---+
 3 |'d'|'a'|'k'|'i'|
   +---+---+---+---+
 4 |'h'|'c'|'b'|
   +---+---+---+
 5 |'f'|
   +---+---+
 6 |'j'|'g'|
   +---+---+

Of course, the decompresser must make use of the metadata at the beginning of the compressed file to construct these arrays. (How that is accomplished is addressed later.) Having done that, its job is to carry out this high-level algorithm:

while (hasMoreBits()) {
   BitString x := emptyString;
   do {
      x.append(nextBit());   // append next bit onto rear of x 
   }
   while (!isCodeword(x));
   emit nativeRepOf(x);  // emit the native representation of 
}                        // the symbol whose codeword is x

What is not obvious is how to implement isCodeword() and nativeRepOf() making use of nothing but the data stored in arrays minCW[] and CW2Symbol[][].

The solutions to these two problems rely, respectively, upon the guarantees that the set of codewords possesses the Longer's-Prefix-is-Lesser and Consecutive-Values properties!

To illustrate how we can tell, as bits are appended to x, when it has finally become equal to some codeword, suppose that z is a codeword and let zk be the prefix of z of length k, for all k in the range 1..|z|. By the Longer-is-Lesser property of the codewords, we have that #(zi) < minCW[i] for all i<|z|. Trivially, we also have that #(z) ≥ minCW[|z|]. That is, every proper prefix z' of z is numerically less than the smallest codeword of length |z'|, but z itself is (obviously) numerically greater than or equal to the smallest codeword of length |z|. Thus, as bits are appended to x during execution of the algorithm, we know that its value has become a codeword upon the condition #(x) < minCW(|x|) becoming false. Moreover, because of the Consecutive-Values property, we know at that point that the symbol whose codeword is x is the one in CW2Symbol[|x|][j], where j = #(x) - minCW(|x|).

Here, then, is a more concrete version of the decompression algorithm. Rather than using variable x to store the bit string consumed so far (which is necessarily the prefix of some codeword), variables v and len are used, where v = #(x) and len = |x|.

while (hasMoreBits()) {
   v := 0;  len := 0;
   // loop invariant: 
   //    Let x be the bit string consumed so far (during current iteration of outer loop)
   //    Then #(x) = v ∧ len = |x| ∧ no prefix of x is a codeword
   do {
      v := 2*v + nextBit();
      len := len+1;
   }
   while (v < minCW[len]);

   // v is a codeword of length len
   j := v - minCW[len];
   emit(CW2Symbol[len][j]);
}

One issue that must be addressed is what value to place into minCW[k] in case there are no codewords of length k. Of course, that value must be greater than that of the largest prefix of length k of any codeword having length greater than k. One solution is to set minCW[k] to 2k, as every bit string of length k has a smaller value. Later we will see that, in order to avoid having to treat any lengths (except for the maximum codeword length) as special cases, we can fill the values of minCW[] like this, where ri is the number of codewords of length i:

minCW[maxLen] := 0;
k := maxLen-1;
while (k != 0) {
   minCW[k] := (minCW[k+1] + rk+1 + 1) / 2;
   k := k-1
}

Note that this calculation of minCW[k] is consistent with the Consecutive-Values and Half-of-Successor properties. In particular, the m mentioned in the Half-of-Successor property, which refers to the largest codeword of length k+1, corresponds to minCW[k+1] + rk+1 because of the Consecutive-Values property.

In order to show that these calculations work out when there are lengths for which there are no codewords requires a little bit of work.



Note to reader: Ignore what follows

We will assume, of course, that the file produced by the compresser begins with metadata describing a symbol-to-codeword mapping that is consistent with a Lefter-is-Deeper Huffman Tree. We consider two possible ways in which the metadata might describe that mapping, one in which the Huffman tree is described explicitly and the other in which it is described implicitly. (Both of these possibilities were mentioned earlier.)

Explicit Tree Representation

Here we assume that the metadata begins with a bit string that describes the structure of the (Lefter-is-Deeper) Huffman tree according to a preorder traversal in which each visit to an interior (respectively, leaf) node produces a 0 (respectively, 1). (This manner of encoding a Huffman tree should be familiar to the reader.)

Following that would be a list of the native codes of the symbols, going from the symbol with the lexicographically smallest codeword (corresponding to the tree's leftmost leaf) to the one with the largest (corresponding to the tree's rightmost leaf). Of course, this list of native codes would have to be parsable, meaning that the boundaries between the elements could be determined algorithmically. (If the native codes are of a known fixed length, that would not be a problem; otherwise one could precede each native code with a length indicator in Elias-Gamma form, for example.) Here we are not concerned with the details of how to encode the list of native codes, however.

Implicit Tree Representation

Here we assume that the metadata begins with an encoding of the sequence ⟨ minLen, maxLen, cminLen, cminLen+1, ..., cmaxLen, where minLen and maxLen are the minimum and maximum, respectively, lengths of codewords, and, for each i, ci is the number of (symbols having) codewords of length i. As with the explicit tree representation, that would be followed by the list of the native codes of the symbols.