In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family.
Sometimes loosely referred to as the exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.
The concept of exponential families is credited to E. J. G. Pitman, G. Darmois, and B. O. Koopman in 1935–1936. Exponential families of distributions provide a general framework for selecting a possible alternative parameterisation of a parametric family of distributions, in terms of natural parameters, and for defining useful sample statistics, called the natural sufficient statistics of the family.
Nomenclature difficulty
The terms "distribution" and "family" are often used loosely: Specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution" (like "the normal distribution", meaning "the family of normal distributions"), and the set of all exponential families is sometimes loosely referred to as "the" exponential family.
Definition
Most of the commonly used distributions form an exponential family or subset of an exponential family, listed in the subsection below. The subsections following it are a sequence of increasingly more general mathematical definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.
Examples of exponential family distributions
Exponential families include many of the most common distributions. Among many others, exponential families includes the following:
- normal
- exponential
- gamma
- chi-squared
- beta
- Dirichlet
- Bernoulli
- categorical
- Poisson
- Wishart
- inverse Wishart
- geometric
A number of common distributions are exponential families, but only when certain parameters are fixed and known. For example:
- binomial (with fixed number of trials)
- multinomial (with fixed number of trials)
- negative binomial (with fixed number of failures)
Note that in each case, the parameters which must be fixed are those that set a limit on the range of values that can possibly be observed.
Examples of common distributions that are not exponential families are Student's t, most mixture distributions, and even the family of uniform distributions when the bounds are not fixed. See the section below on examples for more discussion.
Scalar parameter
The value of <math> \theta </math> is called the parameter of the family.
A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form
<math display="block"> f_X{\left( x\, \big|\, \theta \right)} = h(x)\, \exp \left[ \eta(\theta) \cdot T(x) - A(\theta) \right] </math>
where , , , and are known functions. The function must be non-negative.
An alternative, equivalent form often given is
<math display="block"> f_X{\left( x\ \big|\ \theta \right)} = h(x) \, g(\theta) \, \exp \left[\eta(\theta) \cdot T(x)\right] </math>
or equivalently
<math display="block"> f_X{\left( x\ \big|\ \theta \right)} = \exp\left[ \eta(\theta) \cdot T(x) - A(\theta) + B(x) \right].</math>
In terms of log probability,
<math display="block">\log(f_X{\left( x\ \big|\ \theta \right)}) = \eta(\theta) \cdot T(x) - A(\theta) + B(x).</math>
Note that <math>g(\theta) = e^{-A(\theta)}</math> and <math>h(x) = e^{B(x)}</math>.
Support must be independent of
Importantly, the support of <math> f_X{\left( x \big| \theta \right)} </math> (all the possible <math> x </math> values for which <math> f_X\!\left( x \big| \theta \right) </math> is greater than <math> 0 </math>) is required to not depend on <math> \theta ~.</math>
This requirement can be used to exclude a parametric family distribution from being an exponential family.
For example: The Pareto distribution has a pdf which is defined for <math> x \geq x_{\mathsf m} </math> (the minimum value, <math> x_m\ ,</math> being the scale parameter) and its support, therefore, has a lower limit of <math> x_{\mathsf m} ~.</math> Since the support of <math> f_{\alpha, x_m}\!(x) </math> is dependent on the value of the parameter, the family of Pareto distributions does not form an exponential family of distributions (at least when <math> x_m </math> is unknown).
Another example: Bernoulli-type distributions – binomial, negative binomial, geometric distribution, and similar – can only be included in the exponential class if the number of Bernoulli trials, , is treated as a fixed constant – excluded from the free parameter(s) <math> \theta </math> – since the allowed number of trials sets the limits for the number of "successes" or "failures" that can be observed in a set of trials.
Vector valued and
Often <math> x </math> is a vector of measurements, in which case <math> T(x) </math> may be a function from the space of possible values of <math> x </math> to the real numbers.
More generally, <math> \eta(\theta) </math> and <math> T(x) </math> can each be vector-valued such that <math> \eta(\theta) \cdot T(x) </math> is real-valued. However, see the discussion below on vector parameters, regarding the exponential family.
Canonical formulation
If <math> \eta(\theta) = \theta \ ,</math> then the exponential family is said to be in canonical form. By defining a transformed parameter <math> \eta = \eta(\theta)\ ,</math> it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since <math> \eta(\theta) </math> can be multiplied by any nonzero constant, provided that is multiplied by that constant's reciprocal, or a constant can be added to <math> \eta(\theta) </math> and multiplied by <math> \exp\left[{-c} \cdot T(x)\,\right] </math> to offset it. In the special case that <math> \eta(\theta) = \theta </math> and , then the family is called a natural exponential family.
Even when <math> x </math> is a scalar, and there is only a single parameter, the functions <math> \eta(\theta) </math> and <math> T(x) </math> can still be vectors, as described below.
The function <math> A(\theta)\ ,</math> or equivalently <math> g(\theta)\ ,</math> is automatically determined once the other functions have been chosen, since it must assume a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of <math> \eta\ ,</math> even when <math> \eta(\theta) </math> is not a one-to-one function, i.e. two or more different values of <math> \theta </math> map to the same value of <math> \eta(\theta)\ ,</math> and hence <math> \eta(\theta) </math> cannot be inverted. In such a case, all values of <math> \theta </math> mapping to the same <math> \eta(\theta) </math> will also have the same value for <math> A(\theta) </math> and <math> g(\theta) ~.</math>
Factorization of the variables involved
What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must factorize (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an exponentiation operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms:
<math display="block"> \begin{align}
f(x) , && c^{f(x)} , && {[f(x)]}^c , && {[f(x)]}^{g(\theta)} , && {[f(x)]}^{h(x)g(\theta)} , \\
g(\theta) , && c^{g(\theta)} , && {[g(\theta)]}^c , && {[g(\theta)]}^{f(x)} , && ~~\mathsf{ or }~~ {[g(\theta)]}^{h(x)j(\theta)} ,
\end{align} </math>
where and are arbitrary functions of , the observed statistical variable; and are arbitrary functions of <math> \theta,</math> the fixed parameters defining the shape of the distribution; and is any arbitrary constant expression (i.e. a number or an expression that does not change with either or <math> \theta </math>).
There are further restrictions on how many such factors can occur. For example, the two expressions:
<math display="block">{[f(x) g(\theta)]}^{h(x)j(\theta)}, \qquad {[f(x)]}^{h(x)j(\theta)} {[g(\theta)]}^{h(x)j(\theta)},</math>
are the same, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form,
<math display="block">\begin{align}
{\left[f(x) g(\theta)\right]}^{h(x) j(\theta)} &= {\left[f(x)\right]}^{h(x) j(\theta)} {\left[g(\theta)\right]}^{h(x) j(\theta)} \\[4pt]
&= \exp\left\ e^{-(x-\mu)^2/2\sigma^2}.</math>
This is a single-parameter exponential family, as can be seen by setting
<math display="block">\begin{align}
T_\sigma(x) &= \frac x \sigma, &
h_\sigma(x) &= \frac 1 {\sqrt{2\pi\sigma^2 e^{-x^2/2\sigma^2}, \\[4pt]
A_\sigma(\mu) &= \frac{\mu^2}{2\sigma^2}, &
\eta_\sigma(\mu) &= \frac \mu \sigma.
\end{align}</math>
If this is in canonical form, as then .
Normal distribution: unknown mean and unknown variance
Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then
<math display="block">f(y;\mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2 e^{-(y-\mu)^2/2 \sigma^2}.</math>
This is an exponential family which can be written in canonical form by defining
<math display="block">\begin{align}
h(y) &= \frac{1}{\sqrt{2 \pi, &
\boldsymbol{\eta} &= \left[\frac{\mu}{\sigma^2}, ~-\frac{1}{2\sigma^2}\right], \\
T(y) &= \left( y, y^2 \right)^\mathsf{T}, &
A(\boldsymbol{\eta}) &= \frac{\mu^2}{2 \sigma^2} + \log |\sigma| = -\frac{\eta_1^2}{4\eta_2} + \frac{1}{2}\log\left|\frac{1}{2\eta_2} \right|
\end{align}</math>
Binomial distribution
As an example of a discrete exponential family, consider the binomial distribution with known number of trials . The probability mass function for this distribution is
<math display="block">f(x) = \binom{n}{x} p^x {\left(1 - p\right)}^{n-x} , \quad x \in \{0, 1, 2, \ldots, n\}.</math>
This can equivalently be written as
<math display="block">f(x) = \binom{n}{x} \exp\left[x \log\left(\frac{p}{1-p}\right) + n \log(1-p)\right],</math>
which shows that the binomial distribution is an exponential family, whose natural parameter is
<math display="block">\eta = \log\frac{p}{1-p}.</math>
This function of p is known as logit.
Table of distributions
The following table shows how to rewrite a number of common distributions as exponential-family distributions with natural parameters. Refer to the flashcards for main exponential families.
For a scalar variable and scalar parameter, the form is as follows:
<math display="block"> f_X(x \mid \theta) = h(x) \exp\left[\eta({\theta}) T(x) - A(\eta)\right] </math>
For a scalar variable and vector parameter:
<math display="block"> \begin{align}
f_X(x\mid\boldsymbol \theta) &= h(x) \,\exp\left[\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x) - A({\boldsymbol \eta})\right] \\[4pt]
f_X(x\mid\boldsymbol \theta) &= h(x) \, g(\boldsymbol \theta) \, \exp\left[\boldsymbol\eta(\boldsymbol{\theta}) \cdot \mathbf{T}(x)\right]
\end{align}</math>
For a vector variable and vector parameter:
<math display="block"> f_X(\mathbf{x}\mid\boldsymbol \theta) = h(\mathbf{x}) \, \exp \left[\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(\mathbf{x}) - A({\boldsymbol \eta})\right]</math>
The above formulas choose the functional form of the exponential-family with a log-partition function <math>A({\boldsymbol \eta})</math>. The reason for this is so that the moments of the sufficient statistics can be calculated easily, simply by differentiating this function. Alternative forms involve either parameterizing this function in terms of the normal parameter <math>\boldsymbol\theta</math> instead of the natural parameter, and/or using a factor <math>g(\boldsymbol\eta)</math> outside of the exponential. The relation between the latter and the former is:
<math display="block">\begin{align}
A(\boldsymbol{\eta}) &= -\log g(\boldsymbol{\eta}), \\[2pt]
g(\boldsymbol{\eta}) &= e^{- A(\boldsymbol{\eta})}
\end{align}</math>
To convert between the representations involving the two types of parameter, use the formulas below for writing one type of parameter in terms of the other.
{|class="wikitable"
! Distribution
! Parameter(s)
! Natural parameter(s)
! Inverse parameter mapping
! Base measure
! Sufficient statistic
! Log-partition
! Log-partition
|-
| Bernoulli distribution || <math>p</math>
| <math>\log\frac{p}{1-p}</math> <br/> This is the logit function.
| <math>\frac{1}{1+e^{-\eta = \frac{e^\eta}{1+e^{\eta</math> <br/> This is the logistic function.
| <math> 1 </math>
| <math> x </math>
| <math> \log (1+e^{\eta})</math>
| <math> -\log (1-p)</math>
|-
| binomial distribution<br/>with known number of trials <math>n</math> || <math>p</math>
| <math>\log\frac{p}{1-p}</math>
| <math>\frac{1}{1+e^{-\eta = \frac{e^\eta}{1+e^{\eta</math>
| <math> \binom{n}{x} </math>
| <math> x </math>
| <math> n \log (1+e^{\eta})</math>
| <math> -n \log (1-p)</math>
|-
| Poisson distribution || <math>\lambda</math>
| <math>\log\lambda</math>
| <math>e^\eta</math>
| <math> \frac{1}{x!} </math>
| <math> x </math>
| <math> e^{\eta}</math>
| <math> \lambda</math>
|-
| negative binomial distribution<br/>with known number of failures <math>r</math> || <math>p</math>
| <math>\log(1-p)</math>
| <math>1-e^\eta</math>
| <math> \binom{x {+} r {-} 1}{x} </math>
| <math> x </math>
| <math> -r \log (1-e^{\eta})</math>
| <math> -r \log (p)</math>
|-
| exponential distribution || <math>\lambda</math>
| <math>-\lambda </math>
| <math>-\eta </math>
| <math> 1 </math>
| <math> x </math>
| <math> -\log(-\eta)</math>
| <math> -\log\lambda</math>
|-
| Pareto distribution<br/>with known minimum value <math>x_m</math> || <math>\alpha</math>
| <math>-\alpha-1</math>
| <math>-1-\eta</math>
| <math> 1 </math>
| <math> \log x </math>
| <math>\begin{align} & - \log (-1-\eta) \\ & + (1+\eta) \log x_{\mathrm m}\end{align}</math>
| <math> - \log \left(\alpha x_{\mathrm m}^\alpha\right)</math>
|-
| Weibull distribution<br/>with known shape || <math>\lambda</math>
| <math>-\frac{1}{\lambda^k}</math>
| <math>(-\eta)^{-1/k}</math>
| <math> x^{k-1} </math>
| <math> x^k </math>
| <math> \log \left(- \frac{1}{\eta k}\right)</math>
| <math> \log \frac{\lambda^k}{k}</math>
|-
| Laplace distribution<br/>with known mean <math>\mu</math> || <math>b</math>
| <math>-\frac{1}{b}</math>
| <math>-\frac{1}{\eta}</math>
| <math> 1 </math>
| <math> |x-\mu| </math>
| <math> \log\left(-\frac{2}{\eta}\right)</math>
| <math> \log 2b</math>
|-
| chi-squared distribution || <math>\nu</math>
| <math>\frac{\nu}{2}-1 </math>
| <math>2(\eta+1) </math>
| <math> e^{-x/2} </math>
| <math> \log x </math>
| <math> \begin{align} & \log \Gamma(\eta+1) \\ & + (\eta+1)\log 2 \end{align} </math>
| <math> \begin{align} & \log \Gamma{\left(\tfrac{\nu}{2}\right)} \\ &+ \tfrac{\nu}{2} \log 2 \end{align} </math>
|-
| normal distribution<br/>known variance || <math>\mu</math>
| <math>\frac{\mu}{\sigma} </math>
| <math>\sigma\eta </math>
| <math> \frac{e^{-x^2/(2\sigma^2){\sqrt{2\pi}\sigma} </math>
| <math> \frac{x}{\sigma} </math>
| <math> \frac{\eta^2}{2}</math>
| <math> \frac{\mu^2}{2\sigma^2}</math>
|-
| continuous Bernoulli distribution || <math>\lambda</math>
| <math>\log\frac{\lambda}{1-\lambda}</math>
| <math>\frac{e^\eta}{1+e^\eta}</math>
| <math> 1 </math>
| <math> x </math>
| <math> \log\frac{e^\eta - 1}{\eta}</math>
| <math> \begin{align} &\log\left(\tfrac{1 - 2\lambda}{1 - \lambda}\right) \\[1ex] {}-{}& \log^2 \left(\tfrac{1}{\lambda} - 1\right) \end{align}</math>
<br/> where refers to the iterated logarithm
|-
| normal distribution || <math>\mu,\ \sigma^2</math>
| <math>\begin{bmatrix} \dfrac{\mu}{\sigma^2} \\[1ex] -\dfrac{1}{2\sigma^2} \end{bmatrix} </math>
| <math>\begin{bmatrix} -\dfrac{\eta_1}{2\eta_2} \\[1ex] -\dfrac{1}{2\eta_2} \end{bmatrix} </math>
| <math> \frac{1}{\sqrt{2\pi </math>
| <math> \begin{bmatrix} x \\ x^2 \end{bmatrix} </math>
| <math> -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log(-2\eta_2)</math>
| <math> \frac{\mu^2}{2\sigma^2} + \log \sigma</math>
|-
| log-normal distribution || <math>\mu,\ \sigma^2</math>
| <math>\begin{bmatrix} \dfrac{\mu}{\sigma^2} \\[1ex] -\dfrac{1}{2\sigma^2} \end{bmatrix} </math>
| <math>\begin{bmatrix} -\dfrac{\eta_1}{2\eta_2} \\[1ex] -\dfrac{1}{2\eta_2} \end{bmatrix} </math>
| <math> \frac{1}{\sqrt{2\pi}x} </math>
| <math> \begin{bmatrix} \log x \\ (\log x)^2 \end{bmatrix} </math>
| <math> -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2} \log(-2\eta_2)</math>
| <math> \frac{\mu^2}{2\sigma^2} + \log \sigma</math>
|-
| inverse Gaussian distribution || <math>\mu,\ \lambda</math>
| <math>\begin{bmatrix} -\dfrac{\lambda}{2\mu^2} \\[15pt] -\dfrac{\lambda}{2} \end{bmatrix} </math>
| <math>\begin{bmatrix} \sqrt{\dfrac{\eta_2}{\eta_1 \\[15pt] -2\eta_2 \end{bmatrix} </math>
| <math> \frac{1}{\sqrt{2\pi}x^{3/2 </math>
| <math> \begin{bmatrix} x \\[5pt] \dfrac{1}{x} \end{bmatrix} </math>
| <math> -2\sqrt{\eta_1 \eta_2} -\tfrac{1}{2} \log(-2 \eta_2)</math>
| <math> - \tfrac{\lambda}{\mu} - \tfrac{1}{2} \log\lambda </math>
|-
| rowspan=2|gamma distribution || <math>\alpha,\ \beta</math>
| <math>\begin{bmatrix} \alpha-1 \\ -\beta \end{bmatrix} </math>
| <math>\begin{bmatrix} \eta_1+1 \\ -\eta_2 \end{bmatrix} </math>
| rowspan=2|<math> 1 </math>
| rowspan=2|<math> \begin{bmatrix} \log x \\ x \end{bmatrix} </math>
| rowspan=2|<math> \begin{align} &\log \Gamma(\eta_1+1) \\ {}-{}& (\eta_1+1)\log(-\eta_2) \end{align}</math>
| <math> \log \frac{\Gamma(\alpha)}{\beta^\alpha}</math>
|-
| <math>k,\ \theta</math>
| <math>\begin{bmatrix} k-1 \\[5pt] -\dfrac{1}{\theta} \end{bmatrix} </math>
| <math>\begin{bmatrix} \eta_1+1 \\[5pt] -\dfrac{1}{\eta_2} \end{bmatrix} </math>
| <math> \log \left(\theta^k\Gamma(k)\right)</math>
|-
| inverse gamma distribution || <math>\alpha,\ \beta</math>
| <math>\begin{bmatrix} -\alpha-1 \\ -\beta \end{bmatrix} </math>
| <math>\begin{bmatrix} -\eta_1-1 \\ -\eta_2 \end{bmatrix} </math>
| <math> 1 </math>
| <math> \begin{bmatrix} \log x \\ \frac{1}{x} \end{bmatrix} </math>
| <math> \begin{align} &\log \Gamma(-\eta_1-1) \\ + & \left(\eta_1 + 1\right) \log(-\eta_2) \end{align}</math>
| <math> \log \frac{\Gamma(\alpha)}{\beta^\alpha}</math>
|-
| generalized inverse Gaussian distribution || <math>p,\ a,\ b</math>
| <math>\begin{bmatrix} p-1 \\ -a/2 \\ -b/2 \end{bmatrix} </math>
| <math>\begin{bmatrix} \eta_1+1 \\ -2\eta_2\\ -2\eta_3 \end{bmatrix} </math>
| <math> 1 </math>
| <math> \begin{bmatrix} \log x \\ x \\ \frac{1}{x} \end{bmatrix} </math>
| <math> \begin{align}
& \log 2 K_{\eta_1+1}{\!\left(\sqrt{4\eta_2\eta_3}\right)} \\[2pt]
{}-{}&\frac{\eta_1+1}{2} \log\frac{\eta_2}{\eta_3}
\end{align}</math>
| <math> \begin{align} & \log 2 K_{p}(\sqrt{ab}) \\[2pt] &{}- \frac{p}{2} \log\frac{a}{b} \end{align}</math>
|-
| scaled inverse chi-squared distribution || <math>\nu,\ \sigma^2</math>
| <math>\begin{bmatrix} -\dfrac{\nu}{2}-1 \\[10pt] -\dfrac{\nu\sigma^2}{2} \end{bmatrix} </math>
| <math>\begin{bmatrix} -2(\eta_1+1) \\[10pt] \dfrac{\eta_2}{\eta_1+1} \end{bmatrix} </math>
| <math> 1 </math>
| <math>\begin{bmatrix} \log x \\ \frac{1}{x} \end{bmatrix} </math>
| <math> \begin{align}
& \log \Gamma(-\eta_1 - 1) \\[2pt]
+ & \left(\eta_1 + 1\right) \log(-\eta_2)
\end{align}</math>
| <math> \begin{align}
& \log \Gamma{\left(\frac{\nu}{2}\right)} \\[2pt]
{}-{} & \frac{\nu}{2} \log \frac{\nu \sigma^2}{2}
\end{align}</math>
|-
| beta distribution<br/><br/>(variant 1) || <math>\alpha,\ \beta</math>
| <math>\begin{bmatrix} \alpha \\ \beta \end{bmatrix} </math>
| <math>\begin{bmatrix} \eta_1 \\ \eta_2 \end{bmatrix} </math>
| <math> \frac{1}{x(1-x)} </math>
| <math> \begin{bmatrix} \log x \\ \log (1{-}x) \end{bmatrix} </math>
| <math> \log \frac{\Gamma(\eta_1) \, \Gamma(\eta_2)}{\Gamma(\eta_1 + \eta_2)}</math>
| <math> \log \frac{\Gamma(\alpha) \, \Gamma(\beta)}{\Gamma(\alpha + \beta)}</math>
|-
| beta distribution<br/><br/>(variant 2) || <math>\alpha,\ \beta</math>
| <math>\begin{bmatrix} \alpha - 1 \\ \beta - 1 \end{bmatrix} </math>
| <math>\begin{bmatrix} \eta_1 + 1 \\ \eta_2 + 1 \end{bmatrix} </math>
| <math> 1 </math>
| <math> \begin{bmatrix} \log x \\ \log (1{-}x) \end{bmatrix} </math>
| <math>\log \frac{\Gamma(\eta_1 + 1) \, \Gamma(\eta_2 + 1)}{\Gamma(\eta_1 + \eta_2 + 2)}</math>
| <math> \log \frac{\Gamma(\alpha) \, \Gamma(\beta)}{\Gamma(\alpha + \beta)}</math>
|-
| multivariate normal distribution || <math>\boldsymbol\mu,\ \boldsymbol\Sigma</math>
| <math>\begin{bmatrix} \boldsymbol\Sigma^{-1}\boldsymbol\mu \\[5pt] -\frac12\boldsymbol\Sigma^{-1} \end{bmatrix}</math>
| <math>\begin{bmatrix} -\frac12\boldsymbol\eta_2^{-1}\boldsymbol\eta_1 \\[5pt] -\frac12\boldsymbol\eta_2^{-1} \end{bmatrix}</math>
| <math>(2\pi)^{-\frac{k}{2</math>
| <math>\begin{bmatrix} \mathbf{x} \\[5pt] \mathbf{x}\mathbf{x}^{\mathsf T} \end{bmatrix}</math>
| <math> \begin{align}
&-\tfrac{1}{4} \boldsymbol{\eta}_1^{\mathsf T} \boldsymbol{\eta}_2^{-1} \boldsymbol{\eta}_1 \\
&- \tfrac{1}{2} \log \left|-2\boldsymbol\eta_2\right|
\end{align}</math>
| <math> \begin{align}
& \tfrac{1}{2} \boldsymbol{\mu}^\mathsf{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \\
+ & \tfrac{1}{2} \log \left|\boldsymbol{\Sigma}\right|
\end{align}</math>
|-
| categorical distribution<br/><br/>(variant 1) || <math>p_1,\ \ldots,\,p_k</math><br/><br/>where <math display="inline">\sum\limits_{i=1}^k p_i=1</math>
| <math>\begin{bmatrix} \log p_1 \\ \vdots \\ \log p_k \end{bmatrix}</math>
| <math>\begin{bmatrix} e^{\eta_1} \\ \vdots \\ e^{\eta_k} \end{bmatrix}</math> <br/><br/>where <math display="inline">\sum\limits_{i=1}^k e^{\eta_i}=1</math>
| <math> 1 </math>
| <math>\begin{bmatrix} [x=1] \\ \vdots \\ {[x=k]} \end{bmatrix} </math><math>[x=i]</math> is the Iverson bracket
| <math> 0</math>
| <math> 0</math>
|-
| categorical distribution<br/><br/>(variant 2) || <math>p_1,\ \ldots,\,p_k</math><br/><br/>where <math display="inline">\sum\limits_{i=1}^k p_i=1</math>
| <math>\begin{bmatrix} \log p_1+C \\ \vdots \\ \log p_k+C \end{bmatrix}</math>
| <math>\frac{1}{C} \begin{bmatrix} e^{\eta_1} \\ \vdots \\ e^{\eta_k} \end{bmatrix}</math>where <math display="inline">C = \sum\limits_{i=1}^k e^{\eta_i}</math>
| <math> 1 </math>
| <math>\begin{bmatrix} [x=1] \\ \vdots \\ {[x=k]} \end{bmatrix} </math><math>[x=i]</math> is the Iverson bracket
| <math> 0</math>
| <math> 0</math>
|-
| categorical distribution<br/><br/>(variant 3) || <math>p_1,\ \ldots,\,p_k</math><br/><br/>where <math display="inline">p_k = 1 - \sum\limits_{i=1}^{k-1} p_i</math>
| <math>\begin{bmatrix} \log \dfrac{p_1}{p_k} \\[10pt] \vdots \\[5pt] \log \dfrac{p_{k-1{p_k} \\[15pt] 0 \end{bmatrix}</math>
This is the inverse softmax function, a generalization of the logit function.
| <math>\frac{1}{C_1} \begin{bmatrix} e^{\eta_1} \\[5pt] \vdots \\[5pt] e^{\eta_k} \end{bmatrix} =</math>
<br/>
<math>\frac{1}{C_2} \begin{bmatrix} e^{\eta_1} \\[5pt] \vdots \\[5pt] e^{\eta_{k-1 \\[5pt] 1 \end{bmatrix}</math>
<br/>
where
<math display="inline">C_1 = \sum\limits_{i=1}^k e^{\eta_i}</math>
and
<math display="inline">C_2 = 1 + \sum\limits_{i=1}^{k-1} e^{\eta_i}</math>.
This is the softmax function, a generalization of the logistic function.
| <math> 1 </math>
| <math>\begin{bmatrix} [x=1] \\ \vdots \\ {[x=k]} \end{bmatrix} </math><math>[x=i]</math> is the Iverson bracket
| <math> \begin{align}
& \textstyle \log \left(\sum\limits_{i=1}^{k} e^{\eta_i}\right) \\
={}& \textstyle \log \left(1 + \sum\limits_{i=1}^{k-1} e^{\eta_i}\right)
\end{align}
</math>
| <math> -\log p_k </math>
|-
| multinomial distribution<br/>(variant 1)<br/>with known number of trials || <math>p_1,\ \ldots,\,p_k</math><br/><br/>where <math display="inline">\sum\limits_{i=1}^k p_i=1</math>
| <math>\begin{bmatrix} \log p_1 \\ \vdots \\ \log p_k \end{bmatrix}</math>
| <math>\begin{bmatrix} e^{\eta_1} \\ \vdots \\ e^{\eta_k} \end{bmatrix}</math><br/><br/>where <math display="inline">\sum\limits_{i=1}^k e^{\eta_i}=1</math>
| <math> \frac{n!}{\prod\limits_{i=1}^k x_i!} </math>
| <math>\begin{bmatrix} x_1 \\ \vdots \\ x_k \end{bmatrix} </math>
| <math> 0</math>
| <math> 0</math>
|-
| multinomial distribution<br/>(variant 2)<br/>with known number of trials <math>n</math> || <math>p_1,\ \ldots,\,p_k</math><br/><br/>where <math display="inline">\sum\limits_{i=1}^k p_i=1</math>
| <math>\begin{bmatrix} \log p_1+C \\ \vdots \\ \log p_k+C \end{bmatrix}</math>
| <math>\frac{1}{C} \begin{bmatrix} e^{\eta_1} \\ \vdots \\ e^{\eta_k} \end{bmatrix}</math><br/>
where <math display="inline">C = \sum\limits_{i=1}^k e^{\eta_i}</math>
| <math> \frac{n!}{\prod\limits_{i=1}^k x_i!} </math>
| <math>\begin{bmatrix} x_1 \\ \vdots \\ x_k \end{bmatrix} </math>
| <math> 0</math>
| <math> 0</math>
|-
| multinomial distribution<br/>(variant 3)<br/>with known number of trials <math>n</math>
| <math>p_1,\ \ldots,\,p_k</math><br/><br/>where <math display="inline">p_k = 1 - \sum\limits_{i=1}^{k-1} p_i</math>
| <math>\begin{bmatrix} \log \dfrac{p_1}{p_k} \\[10pt] \vdots \\[5pt] \log \dfrac{p_{k-1{p_k} \\[15pt] 0 \end{bmatrix}</math>
| <math>\frac{1}{C_1} \begin{bmatrix} e^{\eta_1} \\[10pt] \vdots \\[5pt] e^{\eta_k} \end{bmatrix} =</math><br/>
<math>\frac{1}{C_2} \begin{bmatrix} e^{\eta_1} \\[5pt] \vdots \\[5pt] e^{\eta_{k-1 \\[5pt] 1 \end{bmatrix}</math>
where <math display="inline"> C_1 = \sum\limits_{i=1}^k e^{\eta_i}</math> and <math display="inline"> C_2 = 1 + \sum\limits_{i=1}^{k- 1} e^{\eta_i}</math>
| <math> \frac{n!}{\prod\limits_{i=1}^k x_i!} </math>
| <math>\begin{bmatrix} x_1 \\ \vdots \\ x_k \end{bmatrix} </math>
| <math> \begin{align}
& \textstyle n \log \left( \sum\limits_{i=1}^k e^{\eta_i}\right) \\[4pt]
={}& \textstyle n \log \left(1 + \sum\limits_{i=1}^{k-1} e^{\eta_i}\right)
\end{align}</math>
| <math> - n \log p_k </math>
|-
| Dirichlet distribution<br/>(variant 1) || <math>\alpha_1,\ \ldots,\,\alpha_k</math>
| <math>\begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_k \end{bmatrix}</math>
| <math>\begin{bmatrix} \eta_1 \\ \vdots \\ \eta_k \end{bmatrix}</math>
| <math> \frac{1}{\prod\limits_{i=1}^k x_i} </math>
| <math> \begin{bmatrix} \log x_1 \\ \vdots \\ \log x_k \end{bmatrix} </math>
| <math> \begin{align} \textstyle \sum\limits_{i=1}^k \log \Gamma(\eta_i) \\ \textstyle - \log \Gamma{\left(\sum\limits_{i=1}^k \eta_i \right)} \end{align} </math>
| <math> \begin{align}
&\textstyle\sum\limits_{i=1}^k \log \Gamma(\alpha_i) \\
{}-{}& \textstyle \log \Gamma{\left(\sum\limits_{i=1}^k\alpha_i\right)}
\end{align} </math>
|-
| Dirichlet distribution<br/>(variant 2) || <math>\alpha_1,\ \ldots,\,\alpha_k</math>
| <math>\begin{bmatrix} \alpha_1 - 1 \\ \vdots \\ \alpha_k - 1 \end{bmatrix}</math>
| <math>\begin{bmatrix} \eta_1 + 1 \\ \vdots \\ \eta_k + 1 \end{bmatrix}</math>
| <math> 1 </math>
| <math> \begin{bmatrix} \log x_1 \\ \vdots \\ \log x_k \end{bmatrix} </math>
| <math> \begin{align}
& \textstyle \sum\limits_{i=1}^k \log \Gamma(\eta_i + 1) \\
{}-{}& \textstyle \log \Gamma{\left(\sum\limits_{i=1}^k (\eta_i + 1) \right)}
\end{align} </math>
| <math> \begin{align}
& \textstyle \sum\limits_{i=1}^k \log \Gamma(\alpha_i) \\
{}-{}& \textstyle \log \Gamma{\left(\sum\limits_{i=1}^k\alpha_i\right)}
\end{align} </math>
|-
| rowspan=2|Wishart distribution || <math>\mathbf V,\ n</math>
| <math>\begin{bmatrix} -\frac{1}{2} \mathbf{V}^{-1} \\[5pt] \dfrac{n{-}p{-}1}{2} \end{bmatrix}</math>
| <math>\begin{bmatrix} -\frac{1}{2} \boldsymbol{\eta}_1^{-1} \\[5pt] 2\eta_2{+}p{+}1 \end{bmatrix}</math>
| <math> 1 </math>
| <math> \begin{bmatrix} \mathbf{X} \\ \log|\mathbf{X}| \end{bmatrix} </math>
| rowspan=2|<math>\begin{align}
& -\left[\eta_2 + \tfrac{p+1}{2}\right] \log\left|-\boldsymbol\eta_1\right| \\
& + \log\Gamma_p{\left(\eta_2 + \tfrac{p+1}{2}\right)} \\[1ex]
=& - \tfrac{n}{2} \log\left|-\boldsymbol\eta_1\right| \\
& + \log\Gamma_p{\left(\tfrac{n}{2}\right)} \\[1ex]
={}& \left[\eta_2 + \tfrac{p+1}{2}\right] \log\left(2^{p} \left|\mathbf{V}\right|\right) \\
& + \log\Gamma_p{\left(\eta_2 + \tfrac{p+1}{2}\right)}
\end{align}</math>
<br/>
Three variants with different parameterizations are given, to facilitate computing moments of the sufficient statistics.
|rowspan=2|<math> \begin{align}
& \frac{n}{2} \log\left(2^p \left|\mathbf{V}\right|\right) \\[2pt]
& + \log\Gamma_p{\left(\frac{n}{2}\right)}
\end{align}</math>
|-
| colspan=5|Note: Uses the fact that <math>\operatorname{tr}(\mathbf{A}^{\mathsf T}\mathbf{B}) = \operatorname{vec}(\mathbf{A}) \cdot \operatorname{vec}(\mathbf{B}),</math> i.e. the trace of a matrix product is much like a dot product. The matrix parameters are assumed to be vectorized (laid out in a vector) when inserted into the exponential form. Also, <math>\mathbf{V}</math> and <math>\mathbf{X}</math> are symmetric, so e.g. <math>\mathbf{V}^{\mathsf T} = \mathbf{V}\ .</math>
|-
| inverse Wishart distribution || <math>\mathbf \Psi,\,m</math>
| <math>- \frac{1}{2} \begin{bmatrix} \boldsymbol\Psi \\[5pt] m{+}p{+}1 \end{bmatrix}</math>
| <math>-\begin{bmatrix} 2\boldsymbol\eta_1 \\[5pt] 2\eta_2{+}p{+}1 \end{bmatrix}</math>
| <math> 1 </math>
| <math> \begin{bmatrix} \mathbf{X}^{-1} \\ \log|\mathbf{X}| \end{bmatrix} </math>
| <math>\begin{align}
& \left[\eta_2 + \tfrac{p + 1}{2}\right] \log\left|-\boldsymbol\eta_1\right| \\
& + \log \Gamma_p{\left(-\eta_2 - \tfrac{p + 1}{2}\right)}
\\[1ex]
=& -\tfrac{m}{2} \log \left|-\boldsymbol\eta_1\right| \\
& + \log \Gamma_p{\left(\tfrac{m}{2}\right)}
\\[1ex]
=& -\left[\eta_2 + \tfrac{p + 1}{2}\right] \log \tfrac{2^p}{\left|\boldsymbol{\Psi} \right|} \\
& + \log\Gamma_p{\left(-\eta_2 - \tfrac{p + 1}{2}\right)}
\end{align}</math>
|<math>\begin{align}
\frac{m}{2} \log \frac{2^p}{|\boldsymbol\Psi|} \\[4pt]
+ \log \Gamma_p{\left(\frac{m}{2}\right)}
\end{align}</math>
|-
| normal-gamma distribution || <math>\alpha,\ \beta,\ \mu,\ \lambda</math>
| <math>\begin{bmatrix} \alpha-\frac12 \\ -\beta-\dfrac{\lambda\mu^2}{2} \\ \lambda\mu \\ -\dfrac{\lambda}{2}\end{bmatrix} </math>
| <math>\begin{bmatrix} \eta_1+\frac12 \\ -\eta_2 + \dfrac{\eta_3^2}{4\eta_4} \\ -\dfrac{\eta_3}{2\eta_4} \\ -2\eta_4 \end{bmatrix} </math>
| <math> \dfrac{1}{\sqrt{2\pi </math>
| <math> \begin{bmatrix} \log \tau \\ \tau \\ \tau x \\ \tau x^2 \end{bmatrix} </math>
| <math> \begin{align}
&\log \Gamma{\left(\eta_1 + \tfrac{1}{2}\right)} \\[2pt]
-{}& \tfrac{1}{2} \log \left(-2\eta_4\right) \\[2pt]
-{}& \left(\eta_1 + \tfrac{1}{2}\right) \log\left(\tfrac{\eta_3^2}{4\eta_4} - \eta_2\right)
\end{align} </math>
| <math> \begin{align}
&\log \Gamma{\left(\alpha\right)} \\[2pt]
&- \alpha \log \beta \\[2pt]
&- \tfrac{1}{2}\log\lambda
\end{align}</math>
|}
The three variants of the categorical distribution and multinomial distribution are due to the fact that the parameters <math>p_i</math> are constrained, such that
<math display="block">\sum_{i=1}^k p_i = 1 \, .</math>
Thus, there are only <math>k-1</math> independent parameters.
- Variant 1 uses <math>k</math> natural parameters with a simple relation between the standard and natural parameters; however, only <math>k-1</math> of the natural parameters are independent, and the set of <math>k</math> natural parameters is nonidentifiable. The constraint on the usual parameters translates to a similar constraint on the natural parameters.
- Variant 2 demonstrates the fact that the entire set of natural parameters is nonidentifiable: Adding any constant value to the natural parameters has no effect on the resulting distribution. However, by using the constraint on the natural parameters, the formula for the normal parameters in terms of the natural parameters can be written in a way that is independent on the constant that is added.
- Variant 3 shows how to make the parameters identifiable in a convenient way by setting <math>C = -\log p_k\ .</math> This effectively "pivots" around <math>p_k</math> and causes the last natural parameter to have the constant value of 0. All the remaining formulas are written in a way that does not access <math>p_k </math>, so that effectively the model has only <math>k-1</math> parameters, both of the usual and natural kind.
Variants 1 and 2 are not actually standard exponential families at all. Rather they are curved exponential families, i.e. there are <math>k-1</math> independent parameters embedded in a <math>k</math>-dimensional parameter space. Many of the standard results for exponential families do not apply to curved exponential families. An example is the log-partition function <math>A(x) </math>, which has the value of 0 in the curved cases. In standard exponential families, the derivatives of this function correspond to the moments (more technically, the cumulants) of the sufficient statistics, e.g. the mean and variance. However, a value of 0 suggests that the mean and variance of all the sufficient statistics are uniformly 0, whereas in fact the mean of the <math>i</math>th sufficient statistic should be <math>p_i </math>. (This does emerge correctly when using the form of <math>A(x) </math> shown in variant 3.)
Moments and cumulants of the sufficient statistic
Normalization of the distribution
We start with the normalization of the probability distribution. In general, any non-negative function f(x) that serves as the kernel of a probability distribution (the part encoding all dependence on x) can be made into a proper distribution by normalizing: i.e.
<math display="block">p(x) = \frac{1}{Z} f(x)</math>
where
<math display="block">Z = \int_x f(x) \,dx.</math>
The factor is sometimes termed the normalizer or partition function, based on an analogy to statistical physics.
In the case of an exponential family where
<math display="block">p(x; \boldsymbol\eta) = g(\boldsymbol\eta) h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)},</math>
the kernel is
<math display="block">K(x) = h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)}</math>
and the partition function is
<math display="block">Z = \int_x h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)} \,dx.</math>
Since the distribution must be normalized, we have
<math display="block">\begin{align}
1 &= \int_x g(\boldsymbol\eta) h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)}\, dx \\
&= g(\boldsymbol\eta) \int_x h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)} \,dx \\[1ex]
&= g(\boldsymbol\eta) Z.
\end{align}</math>
In other words,
<math display="block">g(\boldsymbol\eta) = \frac{1}{Z}</math>
or equivalently
<math display="block">A(\boldsymbol\eta) = - \log g(\boldsymbol\eta) = \log Z.</math>
This justifies calling the log-normalizer or log-partition function.
Moment-generating function of the sufficient statistic
Now, the moment-generating function of is
<math display="block">\begin{align}
M_T(u) &\equiv \operatorname{E} \left[ \exp\left(u^\mathsf{T} T(x)\right) \mid \eta\right] \\
&= \int_x h(x) \, \exp\left[(\eta+u)^\mathsf{T} T(x)-A(\eta)\right] \, dx \\[1ex]
&= e^{A(\eta + u)-A(\eta)}
\end{align}</math>
proving the earlier statement that
<math display="block">K(u \mid \eta) = A(\eta+u) - A(\eta)</math>
is the cumulant generating function for .
An important subclass of exponential families are the natural exponential families, which have a similar form for the moment-generating function for the distribution of .
Differential identities for cumulants
In particular, using the properties of the cumulant generating function,
<math display="block"> \operatorname{E}(T_j) = \frac{ \partial A(\eta) }{ \partial \eta_j } </math>
and
<math display="block"> \operatorname{cov}\left (T_i,\, T_j \right) = \frac{ \partial^2 A(\eta) }{ \partial \eta_i \, \partial \eta_j }. </math>
The first two raw moments and all mixed second moments can be recovered from these two identities. Higher-order moments and cumulants are obtained by higher derivatives. This technique is often useful when is a complicated function of the data, whose moments are difficult to calculate by integration.
Another way to see this that does not rely on the theory of cumulants is to begin from the fact that the distribution of an exponential family must be normalized, and differentiate. We illustrate using the simple case of a one-dimensional parameter, but an analogous derivation holds more generally.
In the one-dimensional case, we have
<math display="block">p(x) = g(\eta) h(x) e^{\eta T(x)} .</math>
This must be normalized, so
<math display="block">1 = \int_x p(x) \,dx = \int_x g(\eta) h(x) e^{\eta T(x)} \,dx = g(\eta) \int_x h(x) e^{\eta T(x)} \,dx .</math>
Take the derivative of both sides with respect to :
<math display="block">\begin{align}
0 &= g(\eta) \frac{d}{d\eta} \int_x h(x) e^{\eta T(x)} \,dx + g'(\eta)\int_x h(x) e^{\eta T(x)} \,dx \\[1ex]
&= g(\eta) \int_x h(x) \left(\frac{d}{d\eta} e^{\eta T(x)}\right) \,dx + g'(\eta)\int_x h(x) e^{\eta T(x)} \, dx \\[1ex]
&= g(\eta) \int_x h(x) e^{\eta T(x)} T(x) \,dx + g'(\eta)\int_x h(x) e^{\eta T(x)} \, dx \\[1ex]
&= \int_x T(x) g(\eta) h(x) e^{\eta T(x)} \,dx + \frac{g'(\eta)}{g(\eta)}\int_x g(\eta) h(x) e^{\eta T(x)} \, dx \\[1ex]
&= \int_x T(x) p(x) \,dx + \frac{g'(\eta)}{g(\eta)}\int_x p(x) \, dx \\[1ex]
&= \operatorname{E}[T(x)] + \frac{g'(\eta)}{g(\eta)} \\[1ex]
&= \operatorname{E}[T(x)] + \frac{d}{d\eta} \log g(\eta)
\end{align}</math>
Therefore,
<math display="block">\operatorname{E}[T(x)] = - \frac{d}{d\eta} \log g(\eta) = \frac{d}{d\eta} A(\eta).</math>
Example 1
As an introductory example, consider the gamma distribution, whose distribution is defined by
<math display="block">p(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\beta x}.</math>
Referring to the above table, we can see that the natural parameter is given by
<math display="block">\begin{align}
\eta_1 &= \alpha-1, \\
\eta_2 &= -\beta,
\end{align}</math>
the reverse substitutions are
<math display="block">\begin{align}
\alpha &= \eta_1+1, \\
\beta &= -\eta_2,
\end{align}</math>
the sufficient statistics are , and the log-partition function is
<math display="block">A(\eta_1,\eta_2) = \log \Gamma(\eta_1+1)-(\eta_1+1)\log(-\eta_2).</math>
We can find the mean of the sufficient statistics as follows. First, for :
<math display="block">\begin{align}
\operatorname{E}[\log x]
&= \frac{ \partial }{ \partial \eta_1 } A(\eta_1,\eta_2) \\[0.5ex]
&= \frac{ \partial }{ \partial \eta_1 } \left[\log\Gamma(\eta_1+1) - (\eta_1+1) \log(-\eta_2)\right] \\[1ex]
&= \psi(\eta_1+1) - \log(-\eta_2) \\[1ex]
&= \psi(\alpha) - \log \beta,
\end{align}</math>
Where <math>\psi(x)</math> is the digamma function (derivative of log gamma), and we used the reverse substitutions in the last step.
Now, for :
<math display="block">\begin{align}
\operatorname{E}[x] &= \frac{ \partial }{ \partial \eta_2 } A(\eta_1, \eta_2) \\[1ex]
&= \frac{ \partial }{ \partial \eta_2 } \left[\log \Gamma(\eta_1+1) - (\eta_1 + 1) \log(-\eta_2)\right] \\[1ex]
&= -(\eta_1+1)\frac{1}{-\eta_2}(-1)
= \frac{\eta_1+1}{-\eta_2}
= \frac{\alpha}{\beta},
\end{align}</math>
again making the reverse substitution in the last step.
To compute the variance of , we just differentiate again:
<math display="block">\begin{align}
\operatorname{Var}(x)
&= \frac{\partial^2 }{\partial \eta_2^2} A{\left(\eta_1,\eta_2 \right)}
= \frac{\partial}{\partial \eta_2} \frac{\eta_1+1}{-\eta_2} \\[1ex]
&= \frac{\eta_1+1}{\eta_2^2}
= \frac{\alpha}{\beta^2}.
\end{align}</math>
All of these calculations can be done using integration, making use of various properties of the gamma function, but this requires significantly more work.
Example 2
As another example consider a real valued random variable with density
<math display="block"> p_\theta (x) = \frac{ \theta e^{-x} }{\left(1 + e^{-x} \right)^{\theta + 1} } </math>
indexed by shape parameter <math> \theta \in (0,\infty) </math> (this is called the skew-logistic distribution). The density can be rewritten as
<math display="block"> \frac{ e^{-x} } { 1 + e^{-x} } \exp[-\theta \log\left(1 + e^{-x} ) + \log(\theta)\right] </math>
Notice this is an exponential family with natural parameter
<math display="block"> \eta = -\theta,</math>
sufficient statistic
<math display="block"> T = \log\left (1 + e^{-x} \right),</math>
and log-partition function
<math display="block"> A(\eta) = -\log(\theta) = -\log(-\eta)</math>
So using the first identity,
<math display="block"> \operatorname{E}\left[\log\left(1 + e^{-X}\right)\right]
= \operatorname{E}(T)
= \frac{\partial A(\eta)}{\partial \eta}
= \frac{ \partial }{ \partial \eta } [-\log(-\eta)]
= \frac{1}{-\eta}
= \frac{1}{\theta}, </math>
and using the second identity
<math display="block"> \operatorname{var}\left[\log\left(1 + e^{-X} \right)\right]
= \frac{\partial^2 A(\eta)}{\partial \eta^2}
= \frac{\partial}{\partial \eta} \left[\frac{1}{-\eta}\right]
= \frac{1} = (\mathbf{X}^{-1})^\mathsf{T}</math>
Then:
<math display="block">\begin{align}
\operatorname{E}[\mathbf{X}]
&= \frac{\partial}{\partial \boldsymbol{\eta}_1} A\left(\boldsymbol\eta_1,\ldots \right) \\[1ex]
&= \frac{\partial}{\partial \boldsymbol{\eta}_1} \left[-\frac{n}{2} \log\left|-\boldsymbol\eta_1\right| + \log\Gamma_p{\left(\frac{n}{2}\right)} \right] \\[1ex]
&= -\frac{n}{2} ( \boldsymbol{\eta}_1^{-1})^\mathsf{T} \\[1ex]
&= \frac{n}{2} (-\boldsymbol{\eta}_1^{-1})^\mathsf{T} \\[1ex]
&= n(\mathbf{V})^\mathsf{T} \\[1ex]
&= n\mathbf{V}
\end{align}</math>
The last line uses the fact that V is symmetric, and therefore it is the same when transposed.
;Expectation of log (associated with )
Now, for , we first need to expand the part of the log-partition function that involves the multivariate gamma function:
<math display="block"> \begin{align}
\log \Gamma_p(a)
&= \log \left(\pi^{\frac{p(p-1)}{4 \prod_{j=1}^p \Gamma{\left(a + \frac{1-j}{2}\right)}\right) \\
&= \frac{p(p-1)}{4} \log \pi + \sum_{j=1}^p \log \Gamma{\left(a + \frac{1-j}{2}\right)}
\end{align} </math>
We also need the digamma function:
<math display="block">\psi(x) = \frac{d}{dx} \log \Gamma(x).</math>
Then:
<math display="block">\begin{align}
\operatorname{E}[\log |\mathbf{X}|] &= \frac{\partial}{\partial \eta_2} A\left (\ldots,\eta_2 \right) \\[1ex]
&= \frac{\partial}{\partial \eta_2} \left[-\left(\eta_2 + \frac{p+1}{2}\right) \log\left(2^p \left|\mathbf{V}\right|\right) + \log\Gamma_p{\left(\eta_2+\frac{p+1}{2}\right)} \right] \\[1ex]
&= \frac{\partial}{\partial \eta_2} \left[\left(\eta_2 + \frac{p+1}{2}\right) \log\left(2^p \left|\mathbf{V}\right|\right)\right]
+ \frac{\partial}{\partial \eta_2} \left[\frac{p(p-1)}{4} \log \pi\right] \\
&\hphantom{=} + \frac{\partial}{\partial \eta_2} \sum_{j=1}^p \log \Gamma{\left(\eta_2 + \frac{p+1}{2} + \frac{1-j}{2}\right)} \\[1ex]
&= p\log 2 + \log|\mathbf{V}| + \sum_{j=1}^p \psi{\left(\eta_2 + \frac{p+1}{2} + \frac{1-j}{2}\right)} \\[1ex]
&= p\log 2 + \log|\mathbf{V}| + \sum_{j=1}^p \psi{\left(\frac{n-p-1}{2} + \frac{p+1}{2} + \frac{1-j}{2}\right)} \\[1ex]
&= p\log 2 + \log|\mathbf{V}| + \sum_{j=1}^p \psi{\left(\frac{n+1-j}{2}\right)}
\end{align}</math>
This latter formula is listed in the Wishart distribution article. Both of these expectations are needed when deriving the variational Bayes update equations in a Bayes network involving a Wishart distribution (which is the conjugate prior of the multivariate normal distribution).
Computing these formulas using integration would be much more difficult. The first one, for example, would require matrix integration.
Entropy
Relative entropy
The relative entropy (Kullback–Leibler divergence, KL divergence) of two distributions in an exponential family has a simple expression as the Bregman divergence between the natural parameters with respect to the log-normalizer. The relative entropy is defined in terms of an integral, while the Bregman divergence is defined in terms of a derivative and inner product, and thus is easier to calculate and has a closed-form expression (assuming the derivative has a closed-form expression). Further, the Bregman divergence in terms of the natural parameters and the log-normalizer equals the Bregman divergence of the dual parameters (expectation parameters), in the opposite order, for the convex conjugate function.
Fixing an exponential family with log-normalizer (with convex conjugate ), writing <math>P_{A,\theta}</math> for the distribution in this family corresponding a fixed value of the natural parameter (writing for another value, and with for the corresponding dual expectation/moment parameters), writing for the KL divergence, and for the Bregman divergence, the divergences are related as:
<math display="block">\operatorname{KL}(P_{A,\theta} \parallel P_{A,\theta'}) = B_A(\theta' \parallel \theta) = B_{A^*}(\eta \parallel \eta').</math>
The KL divergence is conventionally written with respect to the first parameter, while the Bregman divergence is conventionally written with respect to the second parameter, and thus this can be read as "the relative entropy is equal to the Bregman divergence defined by the log-normalizer on the swapped natural parameters", or equivalently as "equal to the Bregman divergence defined by the dual to the log-normalizer on the expectation parameters".
Maximum-entropy derivation
Exponential families arise naturally as the answer to the following question: what is the maximum-entropy distribution consistent with given constraints on expected values?
The information entropy of a probability distribution can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure with the same support as .
The entropy of relative to is
<math display="block">S[dF\mid dH] = -\int \frac{dF}{dH}\log\frac{dF}{dH}\,dH</math>
or
<math display="block">S[dF\mid dH] = \int\log\frac{dH}{dF}\,dF</math>
where and are Radon–Nikodym derivatives. The ordinary definition of entropy for a discrete distribution supported on a set , namely
<math display="block">S = - \sum_{i\in I} p_i \log p_i</math>
assumes, though this is seldom pointed out, that is chosen to be the counting measure on .
Consider now a collection of observable quantities (random variables) . The probability distribution whose entropy with respect to is greatest, subject to the conditions that the expected value of be equal to , is an exponential family with dH as reference measure and as sufficient statistic.
The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to .
For examples of such derivations, see Maximum entropy probability distribution.
Role in statistics
Classical estimation: sufficiency
According to the Pitman–Koopman–Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases.
Less tersely, suppose , (where ) are independent, identically distributed random variables. Only if their distribution is one of the exponential family of distributions is there a sufficient statistic whose number of scalar components does not increase as the sample size n increases; the statistic may be a vector or a single scalar number, but whatever it is, its size will neither grow nor shrink when more data are obtained.
As a counterexample if these conditions are relaxed, the family of uniform distributions (either discrete or continuous, with either or both bounds unknown) has a sufficient statistic, namely the sample maximum, sample minimum, and sample size, but does not form an exponential family, as the domain varies with the parameters.
Bayesian estimation: conjugate distributions
Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to an exponential family there exists a conjugate prior, which is often also in an exponential family. A conjugate prior π for the parameter <math>\boldsymbol\eta</math> of an exponential family
<math display="block"> f(x \mid \boldsymbol\eta) = h(x) \, \exp \left[ {\boldsymbol\eta}^\mathsf{T} \mathbf{T}(x) - A(\boldsymbol\eta) \right]</math>
is given by
<math display="block">p_\pi(\boldsymbol\eta \mid \boldsymbol\chi,\nu) = f(\boldsymbol\chi,\nu) \, \exp \left[ \boldsymbol\eta^\mathsf{T} \boldsymbol\chi - \nu A(\boldsymbol\eta) \right],</math>
or equivalently
<math display="block">p_\pi(\boldsymbol\eta \mid \boldsymbol\chi,\nu) = f(\boldsymbol\chi,\nu) \, g(\boldsymbol\eta)^\nu \, \exp \left (\boldsymbol\eta^\mathsf{T} \boldsymbol\chi \right ), \qquad \boldsymbol\chi \in \mathbb{R}^s</math>
where s is the dimension of <math>\boldsymbol\eta</math> and <math>\nu > 0 </math> and <math>\boldsymbol\chi</math> are hyperparameters (parameters controlling parameters). <math>\nu</math> corresponds to the effective number of observations that the prior distribution contributes, and <math>\boldsymbol\chi</math> corresponds to the total amount that these pseudo-observations contribute to the sufficient statistic over all observations and pseudo-observations. <math>f(\boldsymbol\chi,\nu)</math> is a normalization constant that is automatically determined by the remaining functions and serves to ensure that the given function is a probability density function (i.e. it is normalized). <math>A(\boldsymbol\eta)</math> and equivalently <math>g(\boldsymbol\eta)</math> are the same functions as in the definition of the distribution over which π is the conjugate prior.
A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution. It can however be represented by using a mixture density as the prior, here a combination of two beta distributions; this is a form of hyperprior.
An arbitrary likelihood will not belong to an exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.
To show that the above prior distribution is a conjugate prior, we can derive the posterior.
First, assume that the probability of a single observation follows an exponential family, parameterized using its natural parameter:
<math display="block"> p_F(x\mid\boldsymbol \eta) = h(x) \, g(\boldsymbol\eta) \, \exp\left[\boldsymbol\eta^\mathsf{T} \mathbf{T}(x)\right]</math>
Then, for data <math>\mathbf{X} = (x_1,\ldots,x_n)</math>, the likelihood is computed as follows:
<math display="block">p(\mathbf{X}\mid\boldsymbol\eta) = \left(\prod_{i=1}^n h(x_i) \right) g(\boldsymbol\eta)^n \exp\left(\boldsymbol\eta^\mathsf{T}\sum_{i=1}^n \mathbf{T}(x_i) \right)</math>
Then, for the above conjugate prior:
<math display="block">\begin{align}
p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu)
&= f(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^\mathsf{T} \boldsymbol\chi) \propto g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^\mathsf{T} \boldsymbol\chi)
\end{align}</math>
We can then compute the posterior as follows:
<math display="block">\begin{align}
p(\boldsymbol\eta\mid\mathbf{X},\boldsymbol\chi,\nu)& \propto p(\mathbf{X}\mid\boldsymbol\eta) p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu) \\
&= \left(\prod_{i=1}^n h(x_i) \right) g(\boldsymbol\eta)^n \exp\left(\boldsymbol\eta^\mathsf{T} \sum_{i=1}^n \mathbf{T}(x_i)\right)
f(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^\mathsf{T} \boldsymbol\chi) \\
&\propto g(\boldsymbol\eta)^n \exp\left(\boldsymbol\eta^\mathsf{T}\sum_{i=1}^n \mathbf{T}(x_i)\right) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^\mathsf{T} \boldsymbol\chi) \\
&= g(\boldsymbol\eta)^{\nu + n} \exp\left(\boldsymbol\eta^\mathsf{T} \left(\boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i)\right)\right)
\end{align}</math>
The last line is the kernel of the posterior distribution, i.e.
<math display="block">p(\boldsymbol\eta\mid\mathbf{X},\boldsymbol\chi,\nu) = p_\pi\left(\boldsymbol\eta\left|~\boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i), \nu + n \right.\right)</math>
This shows that the posterior has the same form as the prior.
The data enters into this equation only in the expression
<math display="block">\mathbf{T}(\mathbf{X}) = \sum_{i=1}^n \mathbf{T}(x_i),</math>
which is termed the sufficient statistic of the data. That is, the value of the sufficient statistic is sufficient to completely determine the posterior distribution. The actual data points themselves are not needed, and all sets of data points with the same sufficient statistic will have the same distribution. This is important because the dimension of the sufficient statistic does not grow with the data size — it has only as many components as the components of <math>\boldsymbol\eta</math> (equivalently, the number of parameters of the distribution of a single data point).
The update equations are as follows:
<math display="block">\begin{align}
\boldsymbol\chi' &= \boldsymbol\chi + \mathbf{T}(\mathbf{X}) \\
&= \boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i) \\
\nu' &= \nu + n
\end{align} </math>
This shows that the update equations can be written simply in terms of the number of data points and the sufficient statistic of the data. This can be seen clearly in the various examples of update equations shown in the conjugate prior page. Because of the way that the sufficient statistic is computed, it necessarily involves sums of components of the data (in some cases disguised as products or other forms — a product can be written in terms of a sum of logarithms). The cases where the update equations for particular distributions don't exactly match the above forms are cases where the conjugate prior has been expressed using a different parameterization than the one that produces a conjugate prior of the above form — often specifically because the above form is defined over the natural parameter <math>\boldsymbol\eta</math> while conjugate priors are usually defined over the actual parameter <math>\boldsymbol\theta .</math>
Unbiased estimation
If the likelihood <math>z|\eta \sim e^{\eta z} f_1(\eta) f_0(z)</math> is an exponential family, then the unbiased estimator of <math>\eta</math> is <math>-\frac{d}{dz} \ln f_0(z)</math>.
Hypothesis testing: uniformly most powerful tests
A one-parameter exponential family has a monotone non-decreasing likelihood ratio in the sufficient statistic , provided that is non-decreasing. As a consequence, there exists a uniformly most powerful test for testing the hypothesis : vs. : .
Generalized linear models
Exponential families form the basis for the distribution functions used in generalized linear models (GLM), a class of model that encompasses many of the commonly used regression models in statistics. Examples include logistic regression using the binomial family and Poisson regression.
See also
- Exponential dispersion model
- Gibbs measure
- Modified half-normal distribution
- Natural exponential family
Footnotes
References
Citations
Sources
- Reprinted as
Further reading
External links
- A primer on the exponential family of distributions
- Exponential family of distributions on the Earliest known uses of some of the words of mathematics
- jMEF: A Java library for exponential families
- Graphical Models, Exponential Families, and Variational Inference by Wainwright and Jordan (2008)
