Huffman coding

[[Image:Huffman tree 2.svg|thumb|Huffman tree generated from the exact frequencies of the text "this is an example of a huffman tree". Encoding the sentence with this code requires 135 (or 147) bits, as opposed to 288 (or 180) bits if 36 characters of 8 (or 5) bits were used (This assumes that the code tree structure is known to the decoder and thus does not need to be counted as part of the transmitted information). The frequencies and codes of each character are shown in the accompanying table.

{| class="wikitable sortable"

!Char!!Freq!!Code

|space||7||111

|a ||4||010

|e ||4||000

|f ||3||1101

|h ||2||1010

|i ||2||1000

|m ||2||0111

|n ||2||0010

|s ||2||1011

|t ||2||0110

|l ||1||11001

|o ||1||00110

|p ||1||10011

|r ||1||11000

|u ||1||00111

|x ||1||10010

]]

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in time linear to the number of input weights if these weights are sorted. However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods – it is replaced with arithmetic coding if a better compression ratio is required.

History

In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.

In doing so, Huffman outdid Fano, who had worked with Claude Shannon to develop a similar code. Building the tree from the bottom up guaranteed optimality, unlike the top-down approach of Shannon–Fano coding.

Terminology

Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free codes", that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol). Huffman coding is such a widespread method for creating prefix codes that the term "Huffman code" is widely used as a synonym for "prefix code" even when such a code is not produced by Huffman's algorithm.

Problem definition

thumb|Constructing a Huffman tree

Informal description

;Given: A set of symbols <math>S</math> and for each symbol <math>x \in S</math>, the frequency <math>f_x</math> representing the fraction of symbols in the text that are equal to <math>x</math>.

;Find: A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root).

Formalized description

Input.

Alphabet <math>A = (a_{1},a_{2},\dots,a_{n})</math>, which is the symbol alphabet of size <math>n</math>.

Tuple <math>W = (w_{1},w_{2},\dots,w_{n})</math>, which is the tuple of the (positive) symbol weights (usually proportional to probabilities), i.e. <math>w_{i} = \operatorname{weight}\left(a_{i}\right),\, i \in \{1, 2, \dots, n\}</math>.

Output.

Code <math>C\left(W\right) = (c_{1},c_{2},\dots,c_{n})</math>, which is the tuple of (binary) codewords, where <math>c_{i}</math> is the codeword for <math>a_{i},\, i \in \{1, 2, \dots, n\}</math>.

Goal.

Let <math display="inline">L(C(W)) = \sum_{i=1}^n w_i\operatorname{length}(c_i)</math> be the weighted path length of code <math>C</math>. Condition: <math>L(C(W)) \leq L(T(W))</math> for any code <math>T(W)</math>.

Example

We give an example of the result of Huffman coding for a code with five characters and given weights. We will not verify that it minimizes L over all codes, but we will compute L and compare it to the Shannon entropy H of the given set of weights; the result is nearly optimal.

{|class="wikitable"

!rowspan="2" style="background:#efefef"| Input (A, W)

!style="background:#efefef;font-weight:normal"| Symbol ()

|align="center" style="background:#efefef"| a

|align="center" style="background:#efefef"| b

|align="center" style="background:#efefef"| c

|align="center" style="background:#efefef"| d

|align="center" style="background:#efefef"| e

!style="background:#efefef"| Sum

!style="background:#efefef;font-weight:normal"| Weights ()

|align="center"| 0.10

|align="center"| 0.15

|align="center"| 0.30

|align="center"| 0.16

|align="center"| 0.29

|align="center"| = 1

!rowspan="3" style="background:#efefef"| Output C

!style="background:#efefef;font-weight:normal"| Codewords ()

|align="center"| <code>010</code>

|align="center"| <code>011</code>

|align="center"| <code>11</code>

|align="center"| <code>00</code>

|align="center"| <code>10</code>

|rowspan="2"| 

!style="background:#efefef;font-weight:normal"| Codeword length (in bits) ()

|align="center"| 3

|align="center"| 2

!style="background:#efefef;font-weight:normal"| Contribution to weighted path length ( )

|align="center"| 0.30

|align="center"| 0.45

|align="center"| 0.60

|align="center"| 0.32

|align="center"| 0.58

|align="center"| L(C) = 2.25

!rowspan="3" style="background:#efefef"| Optimality

!style="background:#efefef;font-weight:normal"| Probability budget ()

| align="center" | 1/8

| align="center" | 1/4

| align="center" | = 1.00

! style="background: #efefef; font-weight: normal;" | Information content (in bits) () ≈

|align="center"| 3.32

|align="center"| 2.74

|align="center"| 1.74

|align="center"| 2.64

|align="center"| 1.79

|align="center"|  

! style="background: #efefef; font-weight: normal;" | Contribution to entropy ()

|align="center"| 0.332

|align="center"| 0.411

|align="center"| 0.521

|align="center"| 0.423

|align="center"| 0.518

|align="center"| H(A) = 2.205

For any code that is biunique, meaning that the code is uniquely decodeable, the sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is strictly equal to one; as a result, the code is termed a complete code. If this is not the case, one can always derive an equivalent code by adding extra symbols (with associated null probabilities), to make the code complete while keeping it biunique.

As defined by Shannon (1948), the information content h (in bits) of each symbol ai with non-null probability is

:<math>h(a_i) = \log_2{1 \over w_i}. </math>

The entropy H (in bits) is the weighted sum, across all symbols with non-zero probability , of the information content of each symbol:

:<math display="block"> H(A) = \sum_{w_i > 0} w_i h(a_i) = \sum_{w_i > 0} w_i \log_2 {1 \over w_i} = - \sum_{w_i > 0} w_i \log_2 w_i. </math>

(Note: A symbol with zero probability has zero contribution to the entropy, since <math>\lim_{w \to 0^+} w \log_2 w = 0</math>. So for simplicity, symbols with zero probability can be left out of the formula above.)

As a consequence of Shannon's source coding theorem, the entropy is a measure of the smallest codeword length that is theoretically possible for the given alphabet with associated weights. In this example, the weighted average codeword length is 2.25 bits per symbol, only slightly larger than the calculated entropy of 2.205 bits per symbol. So not only is this code optimal in the sense that no other feasible code performs better, but it is very close to the theoretical limit established by Shannon.

In general, a Huffman code need not be unique. Thus the set of Huffman codes for a given probability distribution is a non-empty subset of the codes minimizing <math>L(C)</math> for that probability distribution. (However, for each minimizing codeword length assignment, there exists at least one Huffman code with those lengths.)

Basic technique

Compression

thumb|upright=1.5|Visualisation of the use of Huffman coding to encode the message "A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_BED". In steps 2 to 6, the letters are sorted by increasing frequency, and the least frequent two at each step are combined and reinserted into the list, and a partial tree is constructed. The final tree in step 6 is traversed to generate the dictionary in step 7. Step 8 uses it to encode the message.

[[Image:Huffman coding example.svg|thumb|A source generates 4 different symbols <math>\{a_1 , a_2 , a_3 , a_4 \}</math> with probability <math>\{0.4 ; 0.35 ; 0.2 ; 0.05 \}</math>. A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:

{|class="wikitable"

! Symbol !! Code

|a1 || 0

|a2 || 10

|a3 || 110

|a4 || 111

The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the entropy of the source is 1.74 bits/symbol. If this Huffman code is used to represent the signal, then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit because the probabilities of the symbols are different from negative powers of two.]]

The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols, <math>n</math>. A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain a weight, links to two child nodes and an optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to <math>n</math> leaf nodes and <math>n-1</math> internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.

The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root of the Huffman tree.

The simplest construction algorithm uses a priority queue where the node with lowest probability is given highest priority:

Create a leaf node for each symbol and add it to the priority queue.
While there is more than one node in the queue:
Remove the two nodes of highest priority (lowest probability) from the queue
Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
Add the new node to the queue.
The remaining node is the root node and the tree is complete.

Since efficient priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n−1 nodes, this algorithm operates in O(n log n) time, where n is the number of symbols.

If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues: However, blocking arbitrarily large groups of symbols is impractical, as the complexity of a Huffman code is linear in the number of possibilities to be encoded, a number that is exponential in the size of a block. This limits the amount of blocking that is done in practice.

A practical alternative, in widespread use, is run-length encoding. This technique adds one step in advance of entropy coding, specifically counting (runs) of repeated symbols, which are then encoded. For the simple case of Bernoulli processes, Golomb coding is optimal among prefix codes for coding run length, a fact proved via the techniques of Huffman coding. A similar approach is taken by fax machines using modified Huffman coding. However, run-length coding is not as adaptable to as many input types as other compression technologies.

Variations

Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be polynomial time.

n-ary Huffman coding

The n-ary Huffman algorithm uses an alphabet of size n, typically {0, 1, ..., n-1}, to encode messages and build an n-ary tree. This approach was considered by Huffman in his original paper. The same algorithm applies as for binary (<math alt="n equals 2">n = 2</math>) codes, but instead of combining the two least likely symbols, the n least likely symbols are grouped together.

Note that for n > 2, not all sets of source words can properly form a complete n-ary tree for Huffman coding. In these cases, additional placeholder symbols with 0 probability may need to be added. This is because the structure of the tree needs to repeatedly join n branches into one - also known as an "n to 1" combination. For binary coding, this is a "2 to 1" combination, which works with any number of symbols. For n-ary coding, a complete tree is only possible when the total number of symbols (real + placeholders) leaves a remainder of 1 when divided by (n-1). whose solution has been refined for the case of integer costs by Mordecai J. Golin.

Optimal alphabetic binary trees (Hu–Tucker coding)

In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example, <math>A = \left\{a,b,c\right\}</math> could not be assigned code <math>H\left(A,C\right) = \left\{00,1,01\right\}</math>, but instead should be assigned either <math>H\left(A,C\right) =\left\{00,01,1\right\}</math> or <math>H\left(A,C\right) = \left\{0,10,11\right\}</math>. This is also known as the Hu–Tucker problem, after T. C. Hu and Alan Tucker, the authors of the paper presenting the first <math>O(n\log n)</math>-time solution to this optimal binary alphabetic problem, which has some similarities to Huffman algorithm, but is not a variation of this algorithm. A later method, the Garsia–Wachs algorithm of Adriano Garsia and Michelle L. Wachs (1977), uses simpler logic to perform the same comparisons in the same total time bound. These optimal alphabetic binary trees are often used as binary search trees.

The canonical Huffman code

If weights corresponding to the alphabetically ordered inputs are in numerical order, the Huffman code has the same lengths as the optimal alphabetic code, which can be found from calculating these lengths, rendering Hu–Tucker coding unnecessary. The code resulting from numerically (re-)ordered input is sometimes called the canonical Huffman code and is often the code used in practice, due to ease of encoding/decoding. The technique for finding this code is sometimes called Huffman–Shannon–Fano coding, since it is optimal like Huffman coding, but alphabetic in weight probability, like Shannon–Fano coding. The Huffman–Shannon–Fano code corresponding to the example is <math>\{000,001,01,10,11\}</math>, which, having the same codeword lengths as the original solution, is also optimal. But in canonical Huffman code, the result is <math>\{110,111,00,01,10\}</math>.

Applications

Arithmetic coding and Huffman coding produce equivalent results — achieving entropy — when every symbol has a probability of the form 1/2k. In other circumstances, arithmetic coding can offer better compression than Huffman coding because — intuitively — its "code words" can have effectively non-integer bit lengths, whereas code words in prefix codes such as Huffman codes can only have an integer number of bits. Therefore, a code word of length k only optimally matches a symbol of probability 1/2k and other probabilities are not represented optimally; whereas the code word length in arithmetic coding can be made to exactly match the true probability of the symbol. This difference is especially striking for small alphabet sizes.

Prefix codes nevertheless remain in wide use because of their simplicity, high speed, and lack of patent coverage. They are often used as a "back-end" to other compression methods. Deflate (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by the use of prefix codes; these are often called "Huffman codes" even though most applications use pre-defined variable-length codes rather than codes designed using Huffman's algorithm.

References

Bibliography

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. . Section 16.3, pp. 385–392.

External links

Huffman coding in various languages on Rosetta Code
Huffman codes (python implementation)
Canonical Huffman codes (C implementation)
A visualization of Huffman coding