The principle of maximum entropy states that, among all probability distributions consistent with a given set of constraints (such as normalization or specified expectation values), the distribution that maximizes Shannon entropy should be selected. This yields the least committal distribution compatible with the known constraints, introducing no structure beyond what is logically implied by the available information.
The justification is that entropy measures the expected information content (or log-surprise) of outcomes relative to a specified reference measure. Maximizing entropy ensures that no additional structure is imposed beyond the stated constraints. Any lower-entropy alternative would encode extra regularity not required by those constraints and would therefore amount to introducing unsupported information.
It is important that entropy be defined relative to a specified measure or prior. In discrete cases, Shannon entropy is defined relative to the counting measure (or an explicitly specified prior weighting). In continuous cases, differential entropy depends on the choice of coordinates and is not invariant under reparameterization. For this reason, the principled continuous formulation maximizes relative entropy (equivalently, minimizes Kullback–Leibler divergence) with respect to a specified reference measure or prior density m(x), typically by maximizing
<math> -\int p(x)\,\log\frac{p(x)}{m(x)}\,dx </math>
subject to the given constraints. This formulation is invariant under change of variables and makes explicit the role of the underlying prior measure.
History
The principle was first expounded by E. T. Jaynes in two papers in 1957, where he emphasized a natural correspondence between statistical mechanics and information theory. In particular, Jaynes argued that the Gibbsian method of statistical mechanics is sound by also arguing that the entropy of statistical mechanics and the information entropy of information theory are the same concept. Consequently, statistical mechanics should be considered a particular application of a general tool of logical inference and information theory.
Overview
In most practical cases, the stated prior data or testable information is given by a set of conserved quantities (average values of some moment functions), associated with the probability distribution in question. This is the way the maximum entropy principle is most often used in statistical thermodynamics. Another possibility is to prescribe some symmetries of the probability distribution. The equivalence between conserved quantities and corresponding symmetry groups implies a similar equivalence for these two ways of specifying the testable information in the maximum entropy method.
The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods, statistical mechanics and logical inference in particular.
The maximum entropy principle makes explicit our freedom in using different forms of prior data. As a special case, a uniform prior probability density (Laplace's principle of indifference, sometimes called the principle of insufficient reason), may be adopted. Thus, the maximum entropy principle is not merely an alternative way to view the usual methods of inference of classical statistics, but represents a significant conceptual generalization of those methods.
However these statements do not imply that thermodynamical systems need not be shown to be ergodic to justify treatment as a statistical ensemble.
In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.
Testable information
The principle of maximum entropy is useful explicitly only when applied to testable information. Testable information is a statement about a probability distribution whose truth or falsity is well-defined. For example, the statements
:the expectation of the variable <math>x</math> is 2.87
and
<math display="block">p_2 + p_3 > 0.6</math>
(where <math>p_2</math> and <math>p_3</math> are probabilities of events) are statements of testable information.
Given testable information, the maximum entropy procedure consists of seeking the probability distribution which maximizes information entropy, subject to the constraints of the information. This constrained optimization problem is typically solved using the method of Lagrange multipliers.
Entropy maximization with no testable information respects the universal "constraint" that the sum of the probabilities is one. Under this constraint, the maximum entropy discrete probability distribution is the uniform distribution,
<math display="block">p_i=\frac{1}{n}\ {\rm for\ all}\ i\in\{\,1,\dots,n\,\}.</math>
Applications
The principle of maximum entropy is commonly applied in two ways to inferential problems:
Prior probabilities
The principle of maximum entropy is often used to obtain prior probability distributions for Bayesian inference. Jaynes was a strong advocate of this approach, claiming the maximum entropy distribution represented the least informative distribution.
A large amount of literature is now dedicated to the elicitation of maximum entropy priors and links with channel coding.
Posterior probabilities
Maximum entropy is a sufficient updating rule for radical probabilism. Richard Jeffrey's probability kinematics is a special case of maximum entropy inference. However, maximum entropy is not a generalisation of all such sufficient updating rules.
Maximum entropy models
Alternatively, the principle is often invoked for model specification: in this case the observed data itself is assumed to be the testable information. Such models are widely used in natural language processing. An example of such a model is logistic regression, which corresponds to the maximum entropy classifier for independent observations.
The maximum entropy principle has also been applied in economics and resource allocation. For example, the Boltzmann fair division model uses the maximum entropy (Boltzmann) distribution to allocate resources or income among individuals, providing a probabilistic approach to distributive justice.
Exponential families are an important class of probability models than can be derived using the principle of maximum entropy.
Probability density estimation
One of the main applications of the maximum entropy principle is in discrete and continuous density estimation.
Similar to support vector machine estimators,
the maximum entropy principle may require the solution to a quadratic programming problem, and thus provide
a sparse mixture model as the optimal density estimator. One important advantage of the method is its ability to incorporate prior information in the density estimation.
General solution for the maximum entropy distribution with linear constraints
Discrete case
We have some testable information I about a quantity x taking values in {x<sub>1</sub>, x<sub>2</sub>,..., x<sub>n</sub>}. We assume this information has the form of m constraints on the expectations of the functions f<sub>k</sub>; that is, we require our probability distribution to satisfy the moment inequality/equality constraints:
<math display="block">\sum_{i=1}^n \Pr(x_i)f_k(x_i) \geq F_k \qquad k = 1, \ldots,m.</math>
where the <math> F_k </math> are observables. We also require the probability density to sum to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint
<math display="block">\sum_{i=1}^n \Pr(x_i) = 1.</math>
The probability distribution with maximum information entropy subject to these inequality/equality constraints is of the form:
See also
- Akaike information criterion
- Dissipation
- Info-metrics
- Maximum entropy classifier
- Maximum entropy probability distribution
- Maximum entropy spectral estimation
- Maximum entropy thermodynamics
- Principle of maximum caliber
- Thermodynamic equilibrium
- Molecular chaos
- Boltzmann fair division
Notes
References
- Giffin, A. and Caticha, A., 2007, Updating Probabilities with Data and Moments
- Jaynes, E. T., 1986 (new version online 1996), "Monkeys, kangaroos and ", in Maximum-Entropy and Bayesian Methods in Applied Statistics, J. H. Justice (ed.), Cambridge University Press, Cambridge, p. 26.
- Kapur, J. N.; and Kesavan, H. K., 1992, Entropy Optimization Principles with Applications, Boston: Academic Press.
- Kitamura, Y., 2006, Empirical Likelihood Methods in Econometrics: Theory and Practice, Cowles Foundation Discussion Papers 1569, Cowles Foundation, Yale University.
- Owen, A. B., 2001, Empirical Likelihood, Chapman and Hall/CRC. .
Further reading
- Ratnaparkhi A. (1997) "A simple introduction to maximum entropy models for natural language processing" Technical Report 97-08, Institute for Research in Cognitive Science, University of Pennsylvania. An easy-to-read introduction to maximum entropy methods in the context of natural language processing.
- Open access article containing pointers to various papers and software implementations of Maximum Entropy Model on the net.
