Arithmetic coding

Arithmetic coding (AC) is a form of entropy coding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where .

400px|thumb|right|An arithmetic coding example assuming a fixed probability distribution of three symbols "A", "B", and "C". Probability of "A" is 50%, probability of "B" is 33% and probability of "C" is 17%. Furthermore, we assume that the recursion depth is known in each step. In step one we code "B" which is inside the interval [0.5, 0.83): The binary number "0.10x" is the shortest code that represents an interval that is entirely inside [0.5, 0.83). "x" means an arbitrary bit sequence. There are two extreme cases: the smallest x stands for zero which represents the left side of the represented interval. Then the left side of the interval is dec(0.10) = 0.5. At the other extreme, x stands for a finite sequence of ones which has the upper limit dec(0.11) = 0.75. Therefore, "0.10x" represents the interval [0.5, 0.75) which is inside [0.5, 0.83). Now we can leave out the "0." part since all intervals begin with "0." and we can ignore the "x" part because no matter what bit-sequence it represents, we will stay inside [0.5, 0.75).

Relationship to entropy

Arithmetic coding achieves compression by subdividing the interval [0, 1) into sub-intervals proportional to symbol probabilities. When symbol probabilities are unequal, more probable symbols receive larger sub-intervals, which require fewer bits to specify a point within. The theoretical limit on this compression is given by the entropy of the source, which Shannon's source coding theorem establishes as the minimum average number of bits per symbol that any lossless method can achieve. Arithmetic coding approaches this limit closely, especially for long messages.

When all symbols are equally likely, each sub-interval has the same size, and no symbol can be represented with fewer bits than any other. In this case the entropy reaches its maximum of <math>\log_2 n</math> bits per symbol (where <math>n</math> is the alphabet size), and no compression is possible. For example, a stream of independent fair coin flips has entropy of exactly 1 bit per symbol — the full cost of storage — so arithmetic coding provides no benefit. Similarly, independent ternary symbols with equal probabilities have entropy of about 1.585 bits per symbol, the maximum for a three-symbol alphabet, and are likewise incompressible.

To decode the value, knowing the original string had length 6, one can simply convert back to base 3, round to 6 digits, and recover the string.

Defining a model

In general, arithmetic coders can produce near-optimal output for any given set of symbols and probabilities. (The optimal value is −log<sub>2</sub>P bits for each symbol of probability P; see Source coding theorem.)

Encoding and decoding: overview

In general, each step of the encoding process, except for the last, is the same; the encoder has basically just three pieces of data to consider:

Encoding and decoding: example

400px|thumb|right|A diagram showing decoding of 0.538 (the round dot) in the example model. The region is divided into subregions proportional to symbol frequencies, then the subregion containing the point is successively subdivided in the same way.

Consider the process for decoding a message encoded with the given four-symbol model. The message is encoded in the fraction 0.538 (using decimal for clarity, instead of binary; also assuming that there are only as many digits as needed to decode the message.)

When naively Huffman coding binary strings, no compression is possible, even if entropy is low (e.g. ({0, 1}) has probabilities {0.95, 0.05}). Huffman encoding assigns 1 bit to each value, resulting in a code of the same length as the input. By contrast, arithmetic coding compresses bits well, approaching the optimal compression ratio of Pasco cites a pre-publication draft of Rissanen's article and comments on the relationship between their works: although JPEG's arithmetic coding patents have expired due to the age of the JPEG standard (the design of which was approximately completed by 1990). JPEG XL, as well as archivers like PackJPG, Brunsli and Lepton, that can losslessly convert Huffman encoded files to ones with arithmetic coding (or asymmetric numeral systems in case of JPEG XL), show up to 25% size saving.

The JPEG image compression format's arithmetic coding algorithm is based on the following cited patents (since expired).

– (IBM) Filed 4 February 1986, granted 24 March 1987 – Kottappuram M. A. Mohiuddin, Jorma Johannes Rissanen – Multiplication-free multi-alphabet arithmetic code
– (IBM) Filed 18 November 1988, granted 27 February 1990 – Glen George Langdon, Joan L. Mitchell, William B. Pennebaker, Jorma Johannes Rissanen – Arithmetic coding encoder and decoder system
– (IBM) Filed 20 July 1988, granted 19 June 1990 – William B. Pennebaker, Joan L. Mitchell – Probability adaptation for arithmetic coders
JP Patent 1021672 – (Mitsubishi) Filed 21 January 1989, granted 10 August 1990 – Toshihiro Kimura, Shigenori Kino, Fumitaka Ono, Masayuki Yoshida – Coding system
JP Patent 2-46275 – (Mitsubishi) Filed 26 February 1990, granted 5 November 1991 – Fumitaka Ono, Tomohiro Kimura, Masayuki Yoshida, Shigenori Kino – Coding apparatus and coding method

Other patents (mostly also expired) related to arithmetic coding include the following.

– (IBM) Filed 4 March 1977, granted 24 October 1978 – Glen George Langdon, Jorma Johannes Rissanen – Method and means for arithmetic string coding
– (IBM) Filed 28 November 1979, granted 25 August 1981 – Glen George Langdon, Jorma Johannes Rissanen – Method and means for arithmetic coding utilizing a reduced number of operations
– (IBM) Filed 30 March 1981, granted 21 August 1984 – Glen George Langdon, Jorma Johannes Rissanen – High-speed arithmetic compression coding using concurrent value updating
– (IBM) Filed 15 September 1986, granted 2 January 1990 – Joan L. Mitchell, William B. Pennebaker – Arithmetic coding data compression/de-compression by selectively employed, diverse arithmetic coding encoders and decoders
JP Patent 11782787 – (NEC) Filed 13 May 1987, granted 18 November 1988 – Michio Shimada – Data compressing arithmetic encoding device
JP Patent 15015487 – (KDDI) Filed 18 June 1987, granted 22 December 1988 – Shuichi Matsumoto, Masahiro Saito – System for preventing carrying propagation in arithmetic coding
– (IBM) Filed 3 May 1988, granted 12 June 1990 – William B. Pennebaker, Joan L. Mitchell – Probability adaptation for arithmetic coders
– (IBM) Filed 19 June 1989, granted 29 January 1991 – Dan S. Chevion, Ehud D. Karnin, Eugeniusz Walach – Data string compression using arithmetic encoding with simplified probability subinterval estimation
– (IBM) Filed 5 January 1990, granted 24 March 1992 – William B. Pennebaker, Joan L. Mitchell – Probability adaptation for arithmetic coders
– (Ricoh) Filed 17 August 1992, granted 21 December 1993 – James D. Allen – Method and apparatus for entropy coding

Note: This list is not exhaustive. See the following links for a list of more US patents. The Dirac codec uses arithmetic coding and is not patent pending.

Patents on arithmetic coding may exist in other jurisdictions; see software patents for a discussion of the patentability of software around the world.

Benchmarks and other technical characteristics

Every programmatic implementation of arithmetic encoding has a different compression ratio and performance. While compression ratios vary only a little (usually under 1%),