thumb|300px|A specific case of the Metropolis-Hastings algorithm in the Bayesian framework where the proposal density is a uniform prior distribution, sampling a [[Normal distribution|normal one-dimensional posterior probability distribution.]]

In statistics and statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult. New samples are added to the sequence in two steps: first a new sample is proposed based on the previous sample, then the proposed sample is either added to the sequence or rejected depending on the value of the probability distribution at that point. The resulting sequence can be used to approximate the distribution (e.g. to generate a histogram) or to compute an integral (e.g. an expected value).

Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, there are usually other methods (e.g. adaptive rejection sampling) that can directly return independent samples from the distribution, and these are free from the problem of autocorrelated samples that is inherent in MCMC methods.

History

The algorithm is named in part for Nicholas Metropolis, the first coauthor of a 1953 paper, entitled Equation of State Calculations by Fast Computing Machines, with Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller and Edward Teller. For many years the algorithm was known simply as the Metropolis algorithm. The paper proposed the algorithm for the case of symmetrical proposal distributions, but in 1970, W.K. Hastings extended it to the more general case. that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages:

  • The samples are autocorrelated. Even though over the long term they do correctly follow <math>P(x)</math>, a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that effective sample sizes can be significantly lower than the number of samples actually taken, leading to large errors.
  • Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a burn-in period is typically necessary, where an initial number of samples are thrown away.

On the other hand, most simple rejection sampling methods suffer from the "curse of dimensionality", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from hierarchical Bayesian models and other high-dimensional statistical models used nowadays in many disciplines.

In multivariate distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the suitable jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as Gibbs sampling, involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. That way, the problem of sampling from potentially high-dimensional space will be reduced to a collection of problems to sample from small dimensionality. This is especially applicable when the multivariate distribution is composed of a set of individual random variables in which each variable is conditioned on only a small number of other variables, as is the case in most typical hierarchical models. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are the adaptive rejection sampling methods, a simple one-dimensional Metropolis–Hastings step, or slice sampling.

Formal derivation

The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a desired distribution <math>P(x)</math>. To accomplish this, the algorithm uses a Markov process, which asymptotically reaches a unique stationary distribution <math>\pi(x)</math> such that <math>\pi(x) = P(x)</math>. For distribution on discrete state spaces, it has to be of the order of the autocorrelation time of the Markov process.

It is important to notice that it is not clear, in a general problem, which distribution <math>g(x' \mid x)</math> one should use or the number of iterations necessary for proper estimation; both are free parameters of the method, which must be adjusted to the particular problem in hand.

Use in numerical integration

A common use of Metropolis–Hastings algorithm is to compute an integral. Specifically, consider a space <math>\Omega \subset \mathbb{R}</math> and a probability distribution <math>P(x)</math> over <math>\Omega</math>, <math>x \in \Omega</math>. Metropolis–Hastings can estimate an integral of the form of

: <math>P(E) = \int_\Omega A(x) P(x) \,dx,</math>

where <math>A(x)</math> is a (measurable) function of interest.

For example, consider a statistic <math>E(x)</math> and its probability distribution <math>P(E)</math>, which is a marginal distribution. Suppose that the goal is to estimate <math>P(E)</math> for <math>E</math> on the tail of <math>P(E)</math>. Formally, <math>P(E)</math> can be written as

: <math>

P(E) = \int_\Omega P(E\mid x) P(x) \,dx = \int_\Omega \delta\big(E - E(x)\big) P(x) \,dx = E \big(P(E\mid X)\big)

</math>

and, thus, estimating <math>P(E)</math> can be accomplished by estimating the expected value of the indicator function <math>A_E(x) \equiv \mathbf{1}_E(x)</math>, which is 1 when <math>E(x) \in [E, E + \Delta E]</math> and zero otherwise.

Because <math>E</math> is on the tail of <math>P(E)</math>, the probability to draw a state <math>x</math> with <math>E(x)</math> on the tail of <math>P(E)</math> is proportional to <math>P(E)</math>, which is small by definition. The Metropolis–Hastings algorithm can be used here to sample (rare) states more likely and thus increase the number of samples used to estimate <math>P(E)</math> on the tails. This can be done e.g. by using a sampling distribution <math>\pi(x)</math> to favor those states (e.g. <math>\pi(x) \propto e^{a E}</math> with <math>a > 0</math>).

Step-by-step instructions

thumb|300px|Three [[Markov chains running on the 3D Rosenbrock function using the Metropolis–Hastings algorithm. The chains converge and mix in the region where the function is high. The approximate position of the maximum has been illuminated. The red points are the ones that remain after the burn-in process. The earlier ones have been discarded.]]

Suppose that the most recent value sampled is <math>x_t</math>. To follow the Metropolis–Hastings algorithm, we next draw a new proposal state <math>x'</math> with probability density <math>g(x' \mid x_t)</math> and calculate a value

: <math>a = a_1 a_2,</math>

where

: <math>a_1 = \frac{P(x')}{P(x_t)}</math>

is the probability (e.g., Bayesian posterior) ratio between the proposed sample <math>x'</math> and the previous sample <math>x_t</math>, and

: <math>a_2 = \frac{g(x_t \mid x')}{g(x' \mid x_t)}</math>

is the ratio of the proposal density in two directions (from <math>x_t</math> to <math>x'</math> and conversely).

This is equal to 1 if the proposal density is symmetric.

Then the new state <math>x_{t+1}</math> is chosen according to the following rules.

: If <math>a \geq 1{:}</math>

:: <math>x_{t+1} = x',</math>

: else:

:: <math>x_{t+1} =

\begin{cases}

x' & \text{with probability } a, \\

x_t & \text{with probability } 1-a.

\end{cases}

</math>

The Markov chain is started from an arbitrary initial value <math>x_0</math>, and the algorithm is run for many iterations until this initial state is "forgotten". These samples, which are discarded, are known as burn-in. The remaining set of accepted values of <math>x</math> represent a sample from the distribution <math>P(x)</math>.

The algorithm works best if the proposal density matches the shape of the target distribution <math>P(x)</math>, from which direct sampling is difficult, that is <math>g(x' \mid x_t) \approx P(x')</math>.

If a Gaussian proposal density <math>g</math> is used, the variance parameter <math>\sigma^2</math> has to be tuned during the burn-in period.

This is usually done by calculating the acceptance rate, which is the fraction of proposed samples that is accepted in a window of the last <math>N</math> samples.

The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one-dimensional Gaussian distribution is about 50%, decreasing to about 23% for an <math>N</math>-dimensional Gaussian target distribution.

If <math>\sigma^2</math> is too small, the chain will mix slowly (i.e., the acceptance rate will be high, but successive samples will move around the space slowly, and the chain will converge only slowly to <math>P(x)</math>). On the other hand,

if <math>\sigma^2</math> is too large, the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so <math>a_1</math> will be very small, and again the chain will converge very slowly. One typically tunes the proposal distribution so that the algorithms accepts on the order of 30% of all samples – in line with the theoretical estimates mentioned in the previous paragraph.

Bayesian inference

MCMC can be used to draw samples from the posterior distribution of a statistical model.

The acceptance probability is given by:

<math>P_{acc}(\theta_i \to \theta^*)=\min\left(1, \frac{\mathcal{L}(y|\theta^*)P(\theta^*)}{\mathcal{L}(y|\theta_i)P(\theta_i)}\frac{Q(\theta_i|\theta^*)}{Q(\theta^*|\theta_i)}\right),</math>

where <math>\mathcal{L}</math> is the likelihood, <math>P(\theta)</math> the prior probability density and <math>Q</math> the (conditional) proposal probability.

See also

  • Genetic algorithms
  • Mean-field particle methods
  • Metropolis light transport
  • Multiple-try Metropolis
  • Parallel tempering
  • Sequential Monte Carlo
  • Simulated annealing

References

Notes

Further reading