Conditional entropy

thumb|256px|right|[[Venn diagram showing additive and subtractive relationships various information measures associated with correlated variables <math>X</math> and <math>Y</math>. The area contained by both circles is the joint entropy <math>\Eta(X,Y)</math>. The circle on the left (red and violet) is the individual entropy <math>\Eta(X)</math>, with the red being the conditional entropy <math>\Eta(X|Y)</math>. The circle on the right (blue and violet) is <math>\Eta(Y)</math>, with the blue being <math>\Eta(Y|X)</math>. The violet is the mutual information <math>\operatorname{I}(X;Y)</math>.]]

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable <math>Y</math> given that the value of another random variable <math>X</math> is known. Here, information is measured in shannons, nats, or hartleys. The "entropy of <math>Y</math> conditioned on <math>X</math>" is denoted as <math>\Eta(Y|X)</math>.

Definition

The conditional entropy of <math>Y</math> given <math>X</math> is defined as

:<math>\Eta(Y|X)\ = -\sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log \frac {p(x,y)} {p(x)}</math>

where <math>\mathcal X</math> and <math>\mathcal Y</math> denote the support sets of <math>X</math> and <math>Y</math>.

Note: Here, the convention is that the expression <math>0 \log 0</math> should be treated as being equal to zero. This is because <math>\lim_{\theta\to0^+} \theta\, \log \theta = 0</math>.

Intuitively, notice that by definition of expected value and of conditional probability, <math>\displaystyle H(Y|X) </math> can be written as <math> H(Y|X) = \mathbb{E}[f(X,Y)]</math>, where <math> f </math> is defined as <math>\displaystyle f(x,y) := -\log\left(\frac{p(x, y)}{p(x)}\right) = -\log(p(y|x))</math>. One can think of <math>\displaystyle f</math> as associating each pair <math>\displaystyle (x, y)</math> with a quantity measuring the information content of <math>\displaystyle (Y=y)</math> given <math>\displaystyle (X=x)</math>. This quantity is directly related to the amount of information needed to describe the event <math>\displaystyle (Y=y)</math> given <math>(X=x)</math>. Hence by computing the expected value of <math>\displaystyle f </math> over all pairs of values <math>(x, y) \in \mathcal{X} \times \mathcal{Y}</math>, the conditional entropy <math>\displaystyle H(Y|X)</math> measures how much information, on average, the variable <math> X </math> encodes about <math> Y </math>.

Motivation

Let <math>\Eta(Y|X=x)</math> be the entropy of the discrete random variable <math>Y</math> conditioned on the discrete random variable <math>X</math> taking a certain value <math>x</math>. Denote the support sets of <math>X</math> and <math>Y</math> by <math>\mathcal X</math> and <math>\mathcal Y</math>. Let <math>Y</math> have probability mass function <math>p_Y{(y)}</math>. The unconditional entropy of <math>Y</math> is calculated as <math>\Eta(Y) := \mathbb{E}[\operatorname{I}(Y)]</math>, i.e.

:<math>\Eta(Y) = \sum_{y\in\mathcal Y} {\mathrm{Pr}(Y=y)\,\mathrm{I}(y)}

= -\sum_{y\in\mathcal Y} {p_Y(y) \log_2{p_Y(y),</math>

where <math>\operatorname{I}(y_i)</math> is the information content of the outcome of <math>Y</math> taking the value <math>y_i</math>. The entropy of <math>Y</math> conditioned on <math>X</math> taking the value <math>x</math> is defined by:

:<math>\Eta(Y|X=x)

= -\sum_{y\in\mathcal Y} {\Pr(Y = y|X=x) \log_2{\Pr(Y = y|X=x).</math>

Note that <math>\Eta(Y|X)</math> is the result of averaging <math>\Eta(Y|X=x)</math> over all possible values <math>x</math> that <math>X</math> may take. Also, if the above sum is taken over a sample <math>y_1, \dots, y_n</math>, the expected value <math>E_X[ \Eta(y_1, \dots, y_n \mid X = x)]</math> is known in some domains as .

Given discrete random variables <math>X</math> with image <math>\mathcal X</math> and <math>Y</math> with image <math>\mathcal Y</math>, the conditional entropy of <math>Y</math> given <math>X</math> is defined as the weighted sum of <math>\Eta(Y|X=x)</math> for each possible value of <math>x</math>, using <math>p(x)</math> as the weights:

:<math>

\begin{align}

\Eta(Y|X)\ &\equiv \sum_{x\in\mathcal X}\,p(x)\,\Eta(Y|X=x)\\

& =-\sum_{x\in\mathcal X} p(x)\sum_{y\in\mathcal Y}\,p(y|x)\,\log_2\, p(y|x)\\

& =-\sum_{x\in\mathcal X, y\in\mathcal Y}\,p(x)p(y|x)\,\log_2\,p(y|x)\\

& =-\sum_{x\in\mathcal X, y\in\mathcal Y}\,p(x)p(y|x)\,\log_2\,\left(p(y|x)\frac{p(x)}{p(x)}\right)\\

& =-\sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log_2 \frac {p(x,y)} {p(x)}.

\end{align}

</math>

Properties

Conditional entropy equals zero

:<math>\Eta(Y|X)=0</math> if and only if the value of <math>Y</math> is completely determined by the value of <math>X</math>.

Conditional entropy of independent random variables

Conversely, <math>\Eta(Y|X) = \Eta(Y)</math> if and only if <math>Y</math> and <math>X</math> are independent random variables.

Chain rule

Assume that the combined system determined by two random variables <math>X</math> and <math>Y</math> has joint entropy <math>\Eta(X,Y)</math>, that is, we need <math>\Eta(X,Y)</math> bits of information on average to describe its exact state. Now if we first learn the value of <math>X</math>, we have gained <math>\Eta(X)</math> bits of information. Once <math>X</math> is known, we only need <math>\Eta(X,Y)-\Eta(X)</math> bits to describe the state of the whole system. This quantity is exactly <math>\Eta(Y|X)</math>, which gives the chain rule of conditional entropy:

:<math>\Eta(Y|X)\, = \, \Eta(X,Y)- \Eta(X).</math>