Information theory is the mathematical study of the quantification, storage, and communication of a particular type of mathematically defined information. The field was established and formalized by Claude Shannon in the 1940s, though early contributions were made in the 1920s through the works of Harry Nyquist and Ralph Hartley.
Information theory was initially formed in the context of telecommunication but soon found a wide range of other applications. It is now at the intersection of mathematics, statistics and computer science, and has applications in diverse fields ranging from electrical engineering and physics to neurobiology.
As a simple example of the concept, if one flips a fair coin and does not yet know the outcome (heads or tails), then they lack a certain amount of information. After looking at the coin, they gain information about the outcome. For a fair coin, the probability of either heads or tails is 1/2 and the amount of information is expressed as <math>-\log_2(1/2)</math> = 1 bit of information.
A key concept in information theory is entropy. In Shannon's formulation entropy is equal to the lack of information about an event. In the above coin flip example, the entropy in the case where you don't know the outcome is 1 bit. When you know the outcome after the coin has landed, the entropy is zero because you have gained one bit
Information theory has been used in a wide range of applications, such as source coding/data compression (e.g. for ZIP files), and channel coding/error detection and correction (e.g. for DSL). Its impact has been crucial to the success of the Voyager missions to deep space, the invention of the compact disc, the feasibility of mobile phones and the development of the Internet and artificial intelligence. cryptography, neurobiology, perception, signal processing, and function of molecular codes (bioinformatics), thermal physics, molecular dynamics, black holes, quantum computing, information retrieval, intelligence gathering, plagiarism detection, pattern recognition, anomaly detection, the analysis of music, art creation, imaging system design, study of outer space, the dimensionality of space, and epistemology.
Overview
Information theory, as conceived by Claude Shannon, studies the processing and utilization of information within a probabilistic context. Abstractly, in this approach information can be thought of as the resolution of uncertainty. In the case of communication of information over a noisy channel, this abstract concept was formalized in 1948 by Claude Shannon in a paper entitled A Mathematical Theory of Communication, in which information is thought of as a set of possible messages, and the goal is to send these messages over a noisy channel, and to have the receiver reconstruct the message with low probability of error, in spite of the channel noise. Shannon's main result, the noisy-channel coding theorem, showed that, in the limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity, a quantity dependent merely on the statistics of the channel over which the messages are sent.
A third class of information theory codes are cryptographic algorithms (both codes and ciphers). Concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis, such as the unit ban.
Historical background
The landmark event establishing the discipline of information theory and bringing it to immediate worldwide attention was the publication of Claude Shannon's classic paper "A Mathematical Theory of Communication" in the Bell System Technical Journal in July and October 1948. Historian James Gleick rated the paper as the most important development of 1948, noting that the paper was "even more profound and more fundamental" than the transistor. He came to be known as the "father of information theory". Shannon outlined some of his initial ideas of information theory as early as 1939 in a letter to Vannevar Bush.
In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion:
:"The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point."
With it came the ideas of:
- The information entropy and redundancy of a source, and its relevance through the source coding theorem;
- The mutual information, and the channel capacity of a noisy channel, including the promise of perfect loss-free communication given by the noisy-channel coding theorem;
- The practical result of the Shannon–Hartley law for the channel capacity of a Gaussian channel; as well as
- The bit—a new way of seeing the most fundamental unit of information.
Quantities of information
Information theory is based on probability theory and statistics, where quantified information is usually described in terms of bits. Information theory often concerns itself with measures of information of the distributions associated with random variables. One of the most important measures is called entropy, which forms the building block of many other measures. Entropy allows quantification of measure of information in a single random variable.
Another useful concept is mutual information defined on two random variables, which quantifies the dependence between those variables, which is done by comparing the conditional and unconditional distributions. The former quantity is a property of the probability distribution of a random variable and gives a limit on the rate at which data generated by independent samples with the given distribution can be reliably compressed. The latter is a property of the joint distribution of two random variables and is the maximum rate of reliable communication across a noisy channel in the limit of long block lengths, when the channel statistics are determined by the joint distribution.
The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. A common unit of information is the bit or shannon, based on the binary logarithm. Other units include the nat, which is based on the natural logarithm, and the decimal digit, which is based on the common logarithm.
In what follows, an expression of the form is considered by convention to be equal to zero whenever . This is justified because <math>\lim_{p \rightarrow 0^{+ p \log p = 0</math> for any logarithmic base.
Entropy of an information source
Based on the probability mass function of a source, the Shannon entropy H, in units of bits per symbol, is defined as the expected value of the information content of the symbols.
The amount of information conveyed by an individual source symbol <math>x_{i}</math> with probability <math>p_{i}</math> is known as its self-information or surprisal, <math>I(p_{i})</math>. This quantity is defined as:
:<math>I(p_i) = -\log_2(p_i)</math>
A less probable symbol has a larger surprisal, meaning its occurrence provides more information.
:<math>H(X) \ = \ \mathbb{E}_{X}[I(x)] \ = \ \sum_{i} p_i I(p_i) \ = \ -\sum_{i} p_i \log_2(p_i)</math>
Intuitively, the entropy <math>H(X)</math> of a discrete random variable is a measure of the amount of uncertainty associated with the value of <math>X</math> when only its distribution is known.
For example, if one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, no information is transmitted. If, however, each bit is independently and equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted.
thumbnail|right|200px|The entropy of a [[Bernoulli trial as a function of success probability, often called the , . The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.]]
Properties
A key property of entropy is that it is maximized when all the messages in the message space are equiprobable. For a source with possible symbols, where <math display="inline">p_{i} = \frac{1}{n}</math> for all <math>i</math>, the entropy is given by:
:<math>H(X) = \log_2(n)</math>
This maximum value represents the most unpredictable state.
Units
The choice of the logarithmic base in the entropy formula determines the unit of entropy used:
- Other bases are also possible. A base-10 logarithm measures entropy in decimal digits, or hartleys, per symbol.
Binary Entropy Function
The special case of information entropy for a random variable with two outcomes (a Bernoulli trial) is the binary entropy function. This is typically calculated using a base-2 logarithm, and its unit is the shannon. If one outcome has probability , the other has probability . The entropy is given by:
:<math>H_{\mathrm{b(p) = -p \log_2 p - (1-p)\log_2 (1-p)</math>
This function is depicted in the plot shown above, reaching its maximum of 1 bit when , corresponding to the highest uncertainty.
Joint entropy
The of two discrete random variables and is merely the entropy of their pairing: . This implies that if and are independent, then their joint entropy is the sum of their individual entropies.
For example, if represents the position of a chess piece— the row and the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece.
:<math>H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \,</math>
Despite similar notation, joint entropy should not be confused with .
The joint entropy of <math>n</math> discrete random variables <math>X^n \triangleq (X_1, X_2, \ldots, X_n) </math> is
:<math> H(X^n) = H(X_1, X_2, \ldots, X_n) = \mathbb{E} \left[-\log P_{X_1,\ldots, X_n} (X_1,\ldots,X_n)\right]</math>
This can also be represented as a summation of their joint probability mass function:
:<math>
H(X^n) = -\sum_{x_1} \cdots \sum_{x_n} P_{X_1,\ldots,X_n}(x_1,\ldots,x_n) \log P_{X_1,\ldots,X_n}(x_1,\ldots,x_n) </math>.
Thus, joint entropy is just a subcase of entropy where the random variable is a vector giving values in the product space. capacity of discrete memoryless networks with feedback, gambling with causal side information, compression with causal side information,
real-time control communication settings, and in statistical physics.
Other quantities
Other important information theoretic quantities include the Rényi entropy and the Tsallis entropy (generalizations of the concept of entropy), differential entropy (a generalization of quantities of information to continuous distributions), and the conditional mutual information. Also, pragmatic information has been proposed as a measure of how much information has been used in making a decision.
Coding theory
thumb|right|A picture showing scratches on the readable surface of a CD-R. Music and data CDs are coded using error correcting codes and thus can still be read even if they have minor scratches using [[error detection and correction.]]
Coding theory is one of the most important and direct applications of information theory. It can be subdivided into source coding theory and channel coding theory. Using a statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source.
- Data compression (source coding): There are two formulations for the compression problem:
- Lossless data compression: the data must be reconstructed exactly;
- Lossy data compression: allocates bits needed to reconstruct the data, within a specified fidelity level measured by a distortion function. This subset of information theory is called rate–distortion theory.
- Error-correcting codes (channel coding): While data compression removes as much redundancy as possible, an error-correcting code adds just the right kind of redundancy (i.e., error correction) needed to transmit the data efficiently and faithfully across a noisy channel.
This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel) or intermediary "helpers" (the relay channel), or more general networks, compression followed by transmission may no longer be optimal. For general sources and channels that are not necessarily stationary or ergodic, information-spectrum methods characterize coding limits using asymptotic distributions of information density rather than only single-letter entropies or mutual information. A related problem, channel resolvability, asks what rate is required for channel inputs to approximate a target output distribution; Han and Sergio Verdú connected this approximation problem to coding theorems for general channels.
Hayashi later derived general nonasymptotic and asymptotic formulas connecting channel resolvability and identification capacity, and applied these formulas to secrecy analysis for the wiretap channel.
Source theory
Any process that generates successive messages can be considered a of information. A memoryless source is one in which each message is an independent identically distributed random variable, whereas the properties of ergodicity and stationarity impose less restrictive constraints. All such sources are stochastic. These terms are well studied in their own right outside information theory.
====Rate====<!-- This section is linked from Channel capacity -->
Information rate is the average entropy per symbol. For memoryless sources, this is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is:
:<math>r = \lim_{n \to \infty} H(X_n|X_{n-1},X_{n-2},X_{n-3}, \ldots);</math>
that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a process that is not necessarily stationary, the average rate is:
:<math>r = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \dots X_n);</math>
that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result.
The information rate is defined as:
:<math>r = \lim_{n \to \infty} \frac{1}{n} I(X_1, X_2, \dots X_n;Y_1,Y_2, \dots Y_n);</math>
It is common in information theory to speak of the "rate" or "entropy" of a language. This is appropriate, for example, when the source of information is English prose. The rate of a source of information is related to its redundancy and how well it can be compressed, the subject of .
Channel capacity
Communications over a channel is the primary motivation of information theory. However, channels often fail to produce exact reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality.
Consider the communications process over a discrete channel. A simple model of the process is shown below:
:<math title="Channel model">
\xrightarrow[\text{Message}]{W}
\begin{array}{ |c| }\hline \text{Encoder} \\ f_n \\ \hline\end{array} \xrightarrow[\mathrm{Encoded \atop sequence}]{X^n} \begin{array}{ |c| }\hline \text{Channel} \\ p(y|x) \\ \hline\end{array} \xrightarrow[\mathrm{Received \atop sequence}]{Y^n} \begin{array}{ |c| }\hline \text{Decoder} \\ g_n \\ \hline\end{array} \xrightarrow[\mathrm{Estimated \atop message}]{\hat W}</math>
Here <math>X</math> represents the space of messages transmitted, and <math display="inline">Y</math> the space of messages received during a unit time over our channel. Let be the conditional probability distribution function of <math display="inline">Y</math> given <math>X</math>. We will consider to be an inherent fixed property of our communications channel (representing the nature of the noise of our channel). Then the joint distribution of <math>X</math> and <math display="inline">Y</math> is completely determined by our channel and by our choice of , the marginal distribution of messages we choose to send over the channel. Under these constraints, we would like to maximize the rate of information, or the signal, we can communicate over the channel. The appropriate measure for this is the mutual information, and this maximum mutual information is called the and is given by:
:<math> C = \max_{f} I(X;Y).\! </math>
This capacity has the following property related to communicating at information rate R (where R is usually bits per symbol). For any information rate R < C and coding error ε > 0, for large enough N, there exists a code of length N and rate ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε; that is, it is always possible to transmit with arbitrarily small block error. In addition, for any rate R > C, it is impossible to transmit with arbitrarily small block error.
Channel coding is concerned with finding such nearly optimal codes that can be used to transmit data over a noisy channel with a small coding error at a rate near the channel capacity.
Capacity of particular channel models
- A continuous-time analog communications channel subject to Gaussian noise—see Shannon–Hartley theorem.
- A binary symmetric channel (BSC) with crossover probability p is a binary input, binary output channel that flips the input bit with probability p. The BSC has a capacity of bits per channel use, where is the binary entropy function to the base-2 logarithm:
::File:Binary symmetric channel.svg
- A binary erasure channel (BEC) with erasure probability p is a binary input, ternary output channel. The possible channel outputs are 0, 1, and a third symbol 'e' called an erasure. The erasure represents complete loss of information about an input bit. The capacity of the BEC is bits per channel use.
::File:Binary erasure channel.svg
Channels with memory and directed information
In practice many channels have memory. Namely, at time <math> i </math> the channel is given by the conditional probability<math> P(y_i|x_i,x_{i-1},x_{i-2},...,x_1,y_{i-1},y_{i-2},...,y_1) </math>.
It is often more comfortable to use the notation <math> x^i=(x_i,x_{i-1},x_{i-2},...,x_1) </math> and the channel become <math> P(y_i|x^i,y^{i-1}) </math>.
In such a case the capacity is given by the mutual information rate when there is no feedback available and the Directed information rate in the case that either there is feedback or not (if there is no feedback the directed information equals the mutual information).
Fungible information
Fungible information is the information for which the means of encoding is not important. Classical information theorists and computer scientists are mainly concerned with information of this sort. It is sometimes referred as speakable information.
Applications to other fields
Network physiology
Information theory concepts, methods and approaches have broad applications in network physiology, a field which provides a quantitative framework, based on adaptive networks of dynamical systems, to investigate how physiological systems exchange, process, and integrate information as a network to (i) coordinate their functions across levels and scales (from sub-cellular to organs and organism level) and (ii) generate distinct physiological states in health and disease. Through measures such as mutual information, transfer entropy, and co-information, information theory enables the detection of coupling strength, directionality, synergy/redundancy and higher-order interactions among physiological systems and sub-systems, revealing how network cross-communication and regulation occur within the organism. Applications of information-theoretic approaches span from analyzing information transfer between brain and body networks during various states; cardio-respiratory interactions; cardio-muscular interactions; cortico-muscular interactions; brain wave interactions and brain functional networks; network physiology in extreme environments.
Intelligence uses and secrecy applications
Information theoretic concepts apply to cryptography and cryptanalysis. Turing's information unit, the ban, was used in the Ultra project, breaking the German Enigma machine code and hastening the end of World War II in Europe. Shannon himself defined an important concept now called the unicity distance. Based on the redundancy of the plaintext, it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability.
Information theory leads us to believe it is much more difficult to keep secrets than it might first appear. A brute force attack can break systems based on asymmetric key algorithms or on most commonly used methods of symmetric key algorithms (sometimes called secret key algorithms), such as block ciphers. The security of all such methods comes from the assumption that no known attack can break them in a practical amount of time.
Information theoretic security refers to methods such as the one-time pad that are not vulnerable to such brute force attacks. In such cases, the positive conditional mutual information between the plaintext and ciphertext (conditioned on the key) can ensure proper transmission, while the unconditional mutual information between the plaintext and ciphertext remains zero, resulting in absolutely secure communications. In other words, an eavesdropper would not be able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key. However, as in any other cryptographic system, care must be used to correctly apply even information-theoretically secure methods; the Venona project was able to crack the one-time pads of the Soviet Union due to their improper reuse of key material.
Pseudorandom number generation
Pseudorandom number generators are widely available in computer language libraries and application programs. They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern computer equipment and software. A class of improved random number generators is termed cryptographically secure pseudorandom number generators, but even they require random seeds external to the software to work as intended. These can be obtained via extractors, if done carefully. The measure of sufficient randomness in extractors is min-entropy, a value related to Shannon entropy through Rényi entropy; Rényi entropy is also used in evaluating randomness in cryptographic systems. Although related, the distinctions among these measures mean that a random variable with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses.
Seismic exploration
One early commercial application of information theory was in the field of seismic oil exploration. Work in this field made it possible to strip off and separate the unwanted noise from the desired seismic signal. Information theory and digital signal processing offer a major improvement of resolution and image clarity over previous analog methods.
Semiotics
Semioticians and Winfried Nöth both considered Charles Sanders Peirce as having created a theory of information in his works on semiotics. Nauta defined semiotic information theory as the study of "the internal processes of coding, filtering, and information processing."
Integrated process organization of neural information
Quantitative information theoretic methods have been applied in cognitive science to analyze the integrated process organization of neural information in the context of the binding problem in cognitive neuroscience. In this context, either an information-theoretical measure, such as (Gerald Edelman and Giulio Tononi's functional clustering model and dynamic core hypothesis (DCH)) or (Tononi's integrated information theory (IIT) of consciousness), is defined (on the basis of a reentrant process organization, i.e. the synchronization of neurophysiological activity between groups of neuronal populations), or the measure of the minimization of free energy on the basis of statistical methods (Karl J. Friston's free energy principle (FEP), an information-theoretical measure which states that every adaptive change in a self-organized system leads to a minimization of free energy, and the Bayesian brain hypothesis).
Miscellaneous applications
Information theory also has applications in the search for extraterrestrial intelligence, black holes, bioinformatics, and gambling.
See also
- Algorithmic probability
- Bayesian inference
- Communication theory
- Constructor theory – a generalization of information theory that includes quantum information
- Formal science
- Inductive probability
- Info-metrics
- Minimum message length
- Minimum description length
- Philosophy of information
Applications
- Active networking
- Cryptanalysis
- Cryptography
- Cybernetics
- Entropy in thermodynamics and information theory
- Gambling
- Intelligence (information gathering)
- Seismic exploration
History
- Hartley, R.V.L.
- History of information theory
- Shannon, C.E.
- Timeline of information theory
- Yockey, H.P.
- Andrey Kolmogorov
Theory
- Coding theory
- Detection theory
- Estimation theory
- Fisher information
- Information algebra
- Information asymmetry
- Information field theory
- Information geometry
- Information theory and measure theory
- Kolmogorov complexity
- List of unsolved problems in information theory
- Logic of information
- Network coding
- Philosophy of information
- Quantum information science
- Source coding
Concepts
- Ban (unit)
- Channel capacity
- Communication channel
- Communication source
- Conditional entropy
- Covert channel
- Data compression
- Decoder
- Differential entropy
- Fungible information
- Information fluctuation complexity
- Information entropy
- Joint entropy
- Kullback–Leibler divergence
- Mutual information
- Pointwise mutual information (PMI)
- Receiver (information theory)
- Redundancy
- Rényi entropy
- Self-information
- Unicity distance
- Variety
- Hamming distance
- Perplexity
References
Further reading
The classic work
- Shannon, C.E. (1948), "A Mathematical Theory of Communication", Bell System Technical Journal, 27, pp. 379–423 & 623–656, July & October, 1948. PDF. <br />Notes and other formats.
- R.V.L. Hartley, "Transmission of Information" , Bell System Technical Journal, July 1928
- Andrey Kolmogorov (1968), "Three approaches to the quantitative definition of information" in International Journal of Computer Mathematics, 2, pp. 157–168.
Other journal articles
- J. L. Kelly Jr., Princeton , "A New Interpretation of Information Rate" Bell System Technical Journal, Vol. 35, July 1956, pp. 917–26.
- R. Landauer, IEEE.org, "Information is Physical" Proc. Workshop on Physics and Computation PhysComp'92 (IEEE Comp. Sci.Press, Los Alamitos, 1993) pp. 1–4.
Textbooks on information theory
- Alajaji, F. and Chen, P.N. An Introduction to Single-User Information Theory. Singapore: Springer, 2018.
- Arndt, C. Information Measures, Information and its Description in Science and Engineering (Springer Series: Signals and Communication Technology), 2004,
- Gallager, R. Information Theory and Reliable Communication. New York: John Wiley and Sons, 1968.
- Goldman, S. Information Theory. New York: Prentice Hall, 1953. New York: Dover 1968 , 2005
- Csiszar, I, Korner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems Akademiai Kiado: 2nd edition, 1997.
- MacKay, David J. C. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003.
- Mansuripur, M. Introduction to Information Theory. New York: Prentice Hall, 1987.
- McEliece, R. The Theory of Information and Coding. Cambridge, 2002.
- Pierce, JR. "An introduction to information theory: symbols, signals and noise". Dover (2nd Edition). 1961 (reprinted by Dover 1980).
- Stone, JV. Chapter 1 of book "Information Theory: A Tutorial Introduction", University of Sheffield, England, 2014. .
- Yeung, RW. A First Course in Information Theory Kluwer Academic/Plenum Publishers, 2002. .
- Yeung, RW. Information Theory and Network Coding Springer 2008, 2002.
Other books
- Leon Brillouin, Science and Information Theory, Mineola, N.Y.: Dover, [1956, 1962] 2004.
- A. I. Khinchin, Mathematical Foundations of Information Theory, New York: Dover, 1957.
- H. S. Leff and A. F. Rex, Editors, Maxwell's Demon: Entropy, Information, Computing, Princeton University Press, Princeton, New Jersey (1990).
- Robert K. Logan. What is Information? - Propagating Organization in the Biosphere, the Symbolosphere, the Technosphere and the Econosphere, Toronto: DEMO Publishing.
- Tom Siegfried, The Bit and the Pendulum, Wiley, 2000.
- Charles Seife, Decoding the Universe, Viking, 2006.
- Jeremy Campbell, Grammatical Man, Touchstone/Simon & Schuster, 1982,
- Henri Theil, Economics and Information Theory, Rand McNally & Company - Chicago, 1967.
- Escolano, Suau, Bonev, Information Theory in Computer Vision and Pattern Recognition, Springer, 2009.
- Vlatko Vedral, Decoding Reality: The Universe as Quantum Information, Oxford University Press 2010.
External links
- Lambert F. L. (1999), "Shuffled Cards, Messy Desks, and Disorderly Dorm Rooms - Examples of Entropy Increase? Nonsense!", Journal of Chemical Education
- IEEE Information Theory Society and ITSOC Monographs, Surveys, and Reviews
