In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.
The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.
The Shannon information is closely related to entropy, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average". This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.
The information content can be expressed in various units of information, of which the most common is the "bit" (more formally called the shannon), as explained below.
The term 'perplexity' has been used in language modelling to quantify the uncertainty inherent in a set of prospective events.
Definition
Claude Shannon's definition of self-information was chosen to meet several axioms:
- An event with probability 100% is perfectly unsurprising and yields no information.
- The less probable an event is, the more surprising it is and the more information it yields.
- If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.
The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number <math>b>1</math> and an event <math>x</math> with probability <math>P</math>, the information content is defined as the negative log probability:<math display="block">\mathrm{I}(x) := - \log_b{\left[\Pr{\left(x\right)}\right]} = -\log_b{\left(P\right)}. </math>The base <math>b</math> corresponds to the scaling factor above. Different choices of b correspond to different units of information: when <math>b=2</math>, the unit is the shannon (symbol Sh), often called a 'bit'; when <math>b = e</math>, the unit is the natural unit of information (symbol nat); and when <math>b = 10</math>, the unit is the hartley (symbol Hart).
Formally, given a discrete random variable <math>X</math> with probability mass function <math>p_{X}{\left(x\right)}</math>, the self-information of measuring <math>X</math> as outcome <math>x</math> is defined as:<math display="block">\operatorname{I}_{X}(x) := - \log{\left[p_{X}{\left(x\right)}\right]} = \log{\left(\frac{1}{p_{X}{\left(x\right)\right)}. </math>The use of the notation <math>I_X(x)</math> for self-information above is not universal. Since the notation <math>I(X;Y)</math> is also often used for the related quantity of mutual information, many authors use a lowercase <math>h_X(x)</math> for self-entropy instead, mirroring the use of the capital <math>H(X)</math> for the entropy.
Properties
Monotonically decreasing function of probability
For a given probability space, the measurement of rarer events are intuitively more "surprising", and yield more information content than more "common" events. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function.
While standard probabilities are represented by real numbers in the interval <math>[0, 1]</math>, self-information values are non-negative extended real numbers in the interval <math>[0, \infty]</math>. Specifically:
- An event with probability <math>\Pr(x) = 1</math> (a certain event) has an information content of <math>\mathrm{I}(x) = -\log_b(1) = 0</math>. Its occurrence is perfectly unsurprising and reveals no new information.
- An event with probability <math>\Pr(x) = 0</math> (an impossible event) has an information content of <math>\mathrm{I}(x) = -\log_b(0)</math>, which is undefined but is taken to be <math>\infty</math> by convention. This reflects that observing an event believed to be impossible would be infinitely surprising.
This monotonic relationship is fundamental to the use of information content as a measure of uncertainty. For example, learning that a one-in-a-million lottery ticket won provides far more information than learning it lost (See also Lottery mathematics.) This also establishes an intuitive connection to concepts like statistical dispersion; events that are far from the mean or typical outcome (and thus have low probability in many common distributions) have high self-information.
Relationship to log-odds
The Shannon information is closely related to the log-odds. The log-odds of an event <math>x</math>, with probability <math>p(x)</math>, is defined as the logarithm of the odds, <math>\frac{p(x)}{1-p(x)}</math>. This can be expressed as a difference of two information content values:<math display="block">{\displaystyle \begin{align} \text{log-odds}(x)
&= \ \log_b\left(\frac{p(x)}{1-p(x)}\right) \\
&= \ \log_b(p(x)) - \log_b(1-p(x)) \\
&= \ \ \mathrm{I}(\lnot x) \ - \ \mathrm{I}(x), \end{align} }</math>where <math>\lnot x</math> denotes the event not <math>x</math>.
This expression can be interpreted as the amount of information gained (or surprise) from learning the event did not occur, minus the information gained from learning it did occur. This connection is particularly relevant in statistical modeling where log-odds are the core of the logit function and logistic regression.
Additivity of independent events
The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics. Consider two independent random variables <math>X</math> and <math>Y</math> with probability mass functions <math>p_X(x)</math> and <math>p_Y(y)</math>. The joint probability of observing the outcome <math>(x, y)</math> is given by the product of the individual probabilities due to independence:<math display="block"> p_{X, Y}(x, y) = \Pr(X=x, Y=y) = p_X(x) \ p_Y(y)</math>The information content of this joint event is:<math display="block"> {\displaystyle \begin{align} \operatorname{I}_{X,Y}(x, y)
&= \ -\log_b \left[p_{X,Y}(x, y)\right] \\
&= \ -\log_b \left[p_X(x) \ p_Y(y)\right] \\
&= \ -\log_b \left[p_X(x)\right] \ - \ \log_b \left[p_Y(y)\right] \ \\
&= \ \ \operatorname{I}_X(x) \ + \ \operatorname{I}_Y(y), \end{align} }
</math>This additivity makes information content a more mathematically convenient measure than probability in many applications, such as in coding theory where the amount of information needed to describe a sequence of independent symbols is the sum of the information needed for each symbol.
The expectation is taken over the discrete values over its support.
Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies <math>\mathrm{H}(X) = \operatorname{I}(X; X)</math>, where <math>\operatorname{I}(X;X)</math> is the mutual information of <math>X</math> with itself.
For continuous random variables the corresponding concept is differential entropy.
Notes
This measure has also been called surprisal, as it represents the "surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was introduced by Edward W. Samson in his 1951 report "Fundamental natural concepts of information theory". An early appearance in the Physics literature is in Myron Tribus' 1961 book Thermostatics and Thermodynamics.
When the event is a random realization (of a variable) the self-information of the variable is defined as the expected value of the self-information of the realization.
Examples
Fair coin toss
Consider the Bernoulli trial of tossing a fair coin <math>X</math>. The probabilities of the events of the coin landing as heads <math>\text{H}</math> and tails <math>\text{T}</math> (see fair coin and obverse and reverse) are one half each, <math display="inline">p_X{(\text{H})} = p_X{(\text{T})} = \tfrac{1}{2} = 0.5</math>. Upon measuring the variable as heads, the associated information gain is
<math display="block">\operatorname{I}_X(\text{H})
= -\log_2 {p_X{(\text{H})
= -\log_2\!{\tfrac{1}{2 = 1,</math>so the information gain of a fair coin landing as heads is 1 shannon.
