thumb|alt=A graphical representation of an example Boltzmann machine.| A graphical representation of an example Boltzmann machine. Each undirected edge represents dependency. In this example there are 3 hidden units and 4 visible units. This is not a restricted Boltzmann machine.
A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising model), named after Ludwig Boltzmann, is a spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic Ising model. It is a statistical physics technique applied in the context of cognitive science. It is also classified as a Markov random field.
Boltzmann machines are theoretically intriguing because of the locality and Hebbian nature of their training algorithm (being trained by Hebb's rule), and because of their parallelism and the resemblance of their dynamics to simple physical processes. Boltzmann machines with unconstrained connectivity have not been proven useful for practical problems in machine learning or inference, but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems.
They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function. They were heavily popularized and promoted by Geoffrey Hinton, Terry Sejnowski and Yann LeCun in cognitive sciences communities, particularly in machine learning,
Structure
thumb|right|alt=A graphical representation of an example Boltzmann machine with weight labels.| A graphical representation of a Boltzmann machine with a few weights labeled. Each undirected edge represents dependency and is weighted with weight <math>w_{ij}</math>. In this example there are 3 hidden units (blue) and 4 visible units (white). This is not a restricted Boltzmann machine.
A Boltzmann machine, like a Sherrington–Kirkpatrick model, is a network of units with a total "energy" (Hamiltonian) defined for the overall network. Its units produce binary results. Boltzmann machine weights are stochastic. The global energy <math>E</math> in a Boltzmann machine is identical in form to that of Hopfield networks and Ising models:
:<math>E = -\left(\sum_{i<j} w_{ij} \, s_i \, s_j + \sum_i \theta_i \, s_i \right)</math>
Where:
- <math>w_{ij}</math> is the connection strength between unit <math>j</math> and unit <math>i</math>.
- <math>s_i</math> is the state, <math>s_i \in \{0,1\}</math>, of unit <math>i</math>.
- <math>\theta_i</math> is the bias of unit <math>i</math> in the global energy function. (<math>-\theta_i</math> is the activation threshold for the unit.)
Often the weights <math>w_{ij}</math> are represented as a symmetric matrix <math>W=[w_{ij}]</math> with zeros along the diagonal.
Unit state probability
The difference in the global energy that results from a single unit <math>i</math> equaling 0 (off) versus 1 (on), written <math>\Delta E_i</math>, assuming a symmetric matrix of weights, is given by:
:<math>\Delta E_i = \sum_{j>i} w_{ij} \, s_j + \sum_{j<i} w_{ji} \, s_j + \theta_i</math>
This can be expressed as the difference of energies of two states:
:<math>\Delta E_i = E_\text{i=off} - E_\text{i=on}</math>
Substituting the energy of each state with its relative probability according to the Boltzmann factor
(the property of a Boltzmann distribution that the energy of a state is proportional to the negative log probability of that state)
yields:
:<math>
\Delta E_{i}
= -k_{B} T \ln(p_\text{i=off})
- (-k_{B} T \ln(p_\text{i=on})),
</math>
where <math>k_{B}</math> is the Boltzmann constant and is absorbed into the artificial notion of temperature <math>T</math>.
Noting that the probabilities of the unit being on or off sum to <math>1</math> allows for the simplification:
:<math>
-\frac{\Delta E_{i{k_{B}T}
= -\ln(p_{i=\text{on) + \ln(p_{i=\text{off)
= \ln\Big(\frac{1 - p_{i=\text{on}{p_{i=\text{on}\Big)
= \ln(p_{i=\text{on^{-1} - 1),
</math>
whence the probability that the <math>i</math>-th unit is given by
:<math>p_{i=\text{on = \frac{1}{1+\exp\Big(-\frac{\Delta E_{i{k_{B}T}\Big)},</math>
where the scalar <math>T</math> is referred to as the temperature of the system.
This relation is the source of the logistic function found in probability expressions in variants of the Boltzmann machine.
Equilibrium state
The network runs by repeatedly choosing a unit and resetting its state. After running for long enough at a certain temperature, the probability of a global state of the network depends only upon that global state's energy, according to a Boltzmann distribution, and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability distribution of global states has converged. Running the network beginning from a high temperature, its temperature gradually decreases until reaching a thermal equilibrium at a lower temperature. It then may converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.
To train the network so that it will converge to a global state according to an external distribution over these states, the weights must be set so that the global states with the highest probabilities get the lowest energies. This is done by training.
Training
The units in the Boltzmann machine are divided into 'visible' units, V, and 'hidden' units, H. The visible units are those that receive information from the 'environment', i.e. the training set is a set of binary vectors over the set V. The distribution over the training set is denoted <math>P^{+}(V)</math>.
The distribution over global states converges as the Boltzmann machine reaches thermal equilibrium. We denote this distribution, after we marginalize it over the hidden units, as <math>P^{-}(V)</math>.
Our goal is to approximate the "real" distribution <math>P^{+}(V)</math> using the <math>P^{-}(V)</math> produced by the machine. The similarity of the two distributions is measured by the Kullback–Leibler divergence, <math>G</math>:
:<math>G = \sum_{v}{P^{+}(v)\ln\left({\frac{P^{+}(v)}{P^{-}(v)\right)}</math>
where the sum is over all the possible states of <math>V</math>. <math>G</math> is a function of the weights, since they determine the energy of a state, and the energy determines <math>P^{-}(v)</math>, as promised by the Boltzmann distribution. A gradient descent algorithm over <math>G</math> changes a given weight, <math>w_{ij}</math>, by subtracting the partial derivative of <math>G</math> with respect to the weight.
Boltzmann machine training involves two alternating phases. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to <math>P^{+}</math>). The other is the "negative" phase where the network is allowed to run freely, i.e. only the input nodes have their state determined by external data, but the output nodes are allowed to float. The gradient with respect to a given weight, <math>w_{ij}</math>, is given by the equation:
One example of a practical RBM application is in speech recognition.
Deep Boltzmann machine
A deep Boltzmann machine (DBM) is a type of binary pairwise Markov random field (undirected probabilistic graphical model) with multiple layers of hidden random variables. It is a network of symmetrically coupled stochastic binary units. It comprises a set of visible units <math>\boldsymbol{\nu} \in \{0,1\}^D</math> and layers of hidden units <math>\boldsymbol{h}^{(1)} \in \{0,1\}^{F_1}, \boldsymbol{h}^{(2)} \in \{0,1\}^{F_2}, \ldots, \boldsymbol{h}^{(L)} \in \{0,1\}^{F_L}</math>. No connection links units of the same layer (like RBM). For the , the probability assigned to vector is
: <math>p(\boldsymbol{\nu}) = \frac{1}{Z}\sum_h e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^{(1)} + \sum_{jl}W_{jl}^{(2)}h_j^{(1)}h_l^{(2)}+\sum_{lm}W_{lm}^{(3)}h_l^{(2)}h_m^{(3),</math>
where <math>\boldsymbol{h} = \{\boldsymbol{h}^{(1)}, \boldsymbol{h}^{(2)}, \boldsymbol{h}^{(3)} \}</math> are the set of hidden units, and <math>\theta = \{\boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)}, \boldsymbol{W}^{(3)} \} </math> are the model parameters, representing visible-hidden and hidden-hidden interactions. In a DBN only the top two layers form a restricted Boltzmann machine (which is an undirected graphical model), while lower layers form a directed generative model. In a DBM all layers are symmetric and undirected.
Like DBNs, DBMs can learn complex and abstract internal representations of the input in tasks such as object or speech recognition, using limited, labeled data to fine-tune the representations built using a large set of unlabeled sensory input data. However, unlike DBNs and deep convolutional neural networks, they pursue the inference and training procedure in both directions, bottom-up and top-down, which allow the DBM to better unveil the representations of the input structures.
However, the slow speed of DBMs limits their performance and functionality. Because exact maximum likelihood learning is intractable for DBMs, only approximate maximum likelihood learning is possible. Another option is to use mean-field inference to estimate data-dependent expectations and approximate the expected sufficient statistics by using Markov chain Monte Carlo (MCMC). Similar to basic RBMs and its variants, a spike-and-slab RBM is a bipartite graph, while like GRBMs, the visible units (input) are real-valued. The difference is in the hidden layer, where each hidden unit has a binary spike variable and a real-valued slab variable. A spike is a discrete probability mass at zero, while a slab is a density over continuous domain; their mixture forms a prior.
An extension of ssRBM called μ-ssRBM provides extra modeling capacity using additional terms in the energy function. One of these terms enables the model to form a conditional distribution of the spike variables by marginalizing out the slab variables given an observation.
In mathematics
In more general mathematical setting, the Boltzmann distribution is also known as the Gibbs measure. In statistics and machine learning it is called a log-linear model. In deep learning the Boltzmann distribution is used in the sampling distribution of stochastic neural networks such as the Boltzmann machine.
History
The Boltzmann machine is based on the Sherrington–Kirkpatrick spin glass model by David Sherrington and Scott Kirkpatrick. The seminal publication by John Hopfield (1982) applied methods of statistical mechanics, mainly the recently developed (1970s) theory of spin glasses, to study associative memory (later named the "Hopfield network").
The original contribution in applying such energy-based models in cognitive science appeared in papers by Geoffrey Hinton and Terry Sejnowski. In a 1995 interview, Hinton stated that in 1983 February or March, he was going to give a talk on simulated annealing in Hopfield networks, so he had to design a learning algorithm for the talk, resulting in the Boltzmann machine learning algorithm.
The idea of applying the Ising model with annealed Gibbs sampling was used in Douglas Hofstadter's Copycat project (1984).
The explicit analogy drawn with statistical mechanics in the Boltzmann machine formulation led to the use of terminology borrowed from physics (e.g., "energy"), which became standard in the field. The widespread adoption of this terminology may have been encouraged by the fact that its use led to the adoption of a variety of concepts and methods from statistical mechanics. The various proposals to use simulated annealing for inference were apparently independent.
Similar ideas (with a change of sign in the energy function) are found in Paul Smolensky's "Harmony Theory". Ising models can be generalized to Markov random fields, which find widespread application in linguistics, robotics, computer vision and artificial intelligence.
In 2024, Hopfield and Hinton were awarded Nobel Prize in Physics for their foundational contributions to machine learning, such as the Boltzmann machine.
See also
- Restricted Boltzmann machine
- Helmholtz machine
- Markov random field (MRF)
- Ising model (Lenz–Ising model)
- Hopfield network
References
Further reading
- Kothari P (2020): https://www.forbes.com/sites/tomtaulli/2020/02/02/coronavirus-can-ai-artificial-intelligence-make-a-difference/?sh=1eca51e55817
External links
- Scholarpedia article by Hinton about Boltzmann machines
- Talk at Google by Geoffrey Hinton
