In statistical mechanics, sampling from the Boltzmann distribution is one of those basic principles that quietly sits underneath almost everything we model at equilibrium. It's how you characterize the distribution of microscopic states, estimate thermodynamic averages, and connect the energy landscape to measurable observables like temperature, pressure, composition, etc. Numerically, there are multiple ways to draw samples that represent the equilibrium state. Two of the most instructive ways are:
- Use Metropolis–Hastings to construct Markov chains that have the Boltzmann distribution as a stationary distribution.
- Use Gibbs sampling as a conditional and local way of doing the same thing.
The idea for this post is to give some context for these two ways of sampling, how they relate, and some general background around why the Boltzmann distribution is both fundamental and annoying to work with in practice.
Energy-Based Models (EBMs)
An energy-based model is simply a system whose probability distribution is defined through its energy function. For a thermodynamic system with degrees of freedom $x$, the probability of a configuration $x$ in the canonical ensemble is given by the Boltzmann distribution:
$$ \begin{equation} P(x) = \frac{e^{-\beta E(x)}}{Z}, \qquad Z = \int e^{-\beta E(x)} \, dx, \qquad \beta = \frac{1}{k_B T}. \label{eq:boltzmann} \end{equation} $$
Here, $E(x)$ is the microscopic energy of configuration $x$, $T$ is temperature, and $Z$ (the partition function1) ensures normalization. For discrete microstates you would replace the integral by a sum. In proper Hamiltonian mechanics you would also need to worry about how we measure phase space, but I'm ignoring that here..
What is important is that ratios of probabilities are set by energy differences:
$$ \frac{P(x')}{P(x)} = \exp\big[-\beta \big(E(x') - E(x)\big)\big] = e^{-\beta \Delta E}. $$
So low energy configurations are exponentially more probable than higher energy ones. For example, if $\Delta E = E(x') - E(x) = k_B T \ln 1000$, then $P(x')/P(x) = 10^{-3}$. This captures the essence of the canonical ensemble, that is, probability ratios are determined solely by energy differences.
There's also a lot of interest in machine learning in using EBMs as world models and for self-supervised learning [1]. In that context you again define an energy $E_\theta(x)$, but the partition function is replaced by various approximate tricks and training objectives, because you almost never compute $Z$ exactly there either.
Equilibrium and the Boltzmann Distribution
When a system is in equilibrium its probability distribution over microstates no longer changes in time. If $P(x,t)$ evolves according to some stochastic dynamics, for example the master equation, Fokker–Planck, Langevin, Glauber dynamics, etc., then at equilibrium $\frac{\partial P(x,t)}{\partial t} = 0$.
From a fluctuation–dissipation viewpoint, the microscopic probability fluxes cancel on average and macroscopic observables become stationary quantities. If we take a system coupled to a heat reservoir at fixed $T$, then the stationary distribution of microstates converges to the Boltzmann distribution in Eq. \ref{eq:boltzmann}.[6]
Estimating Expectations
The expectation value typically corresponds to the ensemble average value of an observable quantity. It's the value one would observe if one could ergodically sample from the system for a very long time. Once $P(x)$ is known, the expectation value of any observable $A(x)$ is
$$ \langle A \rangle = \int A(x)\, P(x) \, dx = \frac{\int A(x)\, e^{-\beta E(x)} \, dx}{\int e^{-\beta E(x)} \, dx}. $$
In practice we estimate the expectation value by sampling configurations $x_i \sim P(x)$ and then taking the sample average as $\langle A \rangle \approx \frac{1}{N} \sum_{i=1}^{N} A(x_i)$. The ergodic hypothesis is what lets us identify this time or trajectory average with the ensemble average, at least if the dynamics explores the full relevant region of phase space.
The Boltzmann Distribution
How did we end up with the Boltzmann distribution in the first place? Of course its due to the namesake for which the distribution is named after, Ludwig Boltzmann. In 1868 he published a paper where he derived the equilibrium velocity distribution and the general structure of what we now call the Boltzmann distribution for a kinetic model of a gas [2]. The resulting distribution expresses the probability of microstate configurations as given in Eq. $\ref{eq:boltzmann}$.
With eq. $\ref{eq:boltzmann}$, low energy states dominate while high energy states fade exponentially. Temperature acts as the control parameter that redistributes these probabilities:
- $T \to 0$: the system collapses into ground states and very low-lying minima.
- $T \to \infty$: the system explores accessible states almost uniformly.
The practical challenge is that the partition function, $Z$, is astronomically large for most realistic systems due to the number of degrees of freedom and the effective number of configurations. So direct evaluation or naive summation/integration is basically impossible beyond simple toy models. This is where Markov chain Monte Carlo and Gibbs sampling come in.
Markov Chains and Sampling
To get around the partition function $Z$, we can construct a Markov chain3 whose stationary distribution is the Boltzmann distribution $P(x)$. The trick is that the transition rules only need ratios of probabilities like $P(x')/P(x)$, so the $Z$ cancels nicely.
The basic idea is:
- Build an update rule that depends on the current state $x$ and some proposal mechanism.
- Make sure the probability of moving from state $x$ to $x'$ matches the probability of moving back, weighted by their relative probabilities under $P(x)$."
- Ensure that the chain is ergodic and aperiodic.
Over long times, the sequence of visited states should reproduce the Boltzmann distribution. So the game isn't to write $P(x)$ in closed form, but to engineer a Markov chain that has the correct stationary distribution. So how can we do this? There are two common(?) approaches, the first is the Metropolis–Hastings and the second is Gibbs sampling.
Metropolis–Hastings: Energy-Based Acceptance
The Metropolis–Hastings algorithm feels like an experimental perturbation process [3,4], that is we start with some initial setup, $x$, and then propose a new state, $x'$, and then decide to accept or reject the proposal if it gave a better result (i.e. lower energy). The steps to implement would be:
- Start with, $x$, propose a move $x \to x'$ from some prior distribution $q(x' \mid x)$.
- Compute the energy difference $\Delta E = E(x') - E(x)$.
- Accept the move with probability $$ \begin{equation} p_{\text{acc}}(x \to x') = \min\left(1, e^{-\beta \Delta E} \frac{q(x \mid x')}{q(x' \mid x)}\right). \label{eq:metropolis} \end{equation} $$
For symmetric proposals where $q(x' \mid x) = q(x \mid x')$, that is the reverse is equally likely as the forward, this reduces to,
$$ p_{\text{acc}}(x \to x') = \min\left(1, e^{-\beta \Delta E}\right). $$
If the move lowers the energy then it is always accepted. If it raises the energy, it may still occur with Boltzmann-weighted probability4. This acceptance rule enforces detailed balance with respect to Eq. \ref{eq:boltzmann}. Under standard conditions, the chain converges to the Boltzmann distribution as the unique stationary distribution.[3,4]
The pain point with Metropolis–Hastings is that in high dimensional systems, many naive proposals (i.e. transition moves) are rejected and the chain diffuses slowly through configuration space. This means long simulation times despite the equilibrium distribution being formally correct. So getting good sampling in a reasonable amount of wall clock time can be challenging if not impossible.
Gibbs Sampling: Conditional Equilibrium
One very useful alternative is Gibbs5 sampling [5]. Instead of proposing global moves, Gibbs sampling uses local conditional updates and draws from exact conditional distributions whenever those are tractable. The key difference from Metropolis–Hastings is that Gibbs sampling doesn't propose and then accept/reject—it samples directly from the conditional distribution, which already has the correct Boltzmann weighting built in. For example, on a spin-lattice, instead of flipping a spin and then deciding whether to accept the flip based on the energy change (as Metropolis–Hastings does), Gibbs sampling directly picks the new spin value with the correct probability given its neighbors. Since we're sampling from the right distribution from the start, every update is automatically accepted.
How do we get the exact probability? if we take a spin-site with 4 neighbors, the energy interaction depends only on those neighbors. We can compute the Boltzmann probability for each possible spin value given those interactions, then normalize. That's it—just a few local energy terms, not the entire system.
+1
|
+1 -- X -- +1
|
+1
For the center spin $X$ with all neighbors $+1$ as shown above, the energy contribution is:
$$ \begin{align} E(X=+1) &= -J\big[(+1)(+1) + (+1)(+1) + (+1)(+1) + (+1)(+1)\big] \nonumber \\ &= -J(4) = -4J \nonumber \end{align} $$
$$ \begin{align} E(X=-1) &= -J\big[(-1)(+1) + (-1)(+1) + (-1)(+1) + (-1)(+1)\big] \nonumber \\ &= -J(-4) = 4J \nonumber \end{align} $$
Then we get the unnormalized probabilities:
$$ P(X=+1) \propto e^{-\beta E(X=+1)} = e^{4\beta J}, \qquad P(X=-1) \propto e^{-\beta E(X=-1)} = e^{-4\beta J} $$
To normalize, we divide by the sum:
$$ P(X=+1) = \frac{e^{4\beta J}}{e^{4\beta J} + e^{-4\beta J}}, \qquad P(X=-1) = \frac{e^{-4\beta J}}{e^{4\beta J} + e^{-4\beta J}} $$
Now $P(X=+1) + P(X=-1) = 1$ and we can sample from this distribution. The aligned state ($X=+1$) is exponentially more probable than the anti-aligned state. As you see we are conditioning on the neighbors, that is we are sampling from the conditional distribution $P(X \mid X_1, X_2, X_3, X_4)$.
So now if we have the energy in terms of many degrees of freedom, $E(x_1, x_2, \ldots, x_N)$, with Gibbs sampling we can update one variable at a time by drawing from its conditional distribution given all the others:
$$ \begin{equation} x_i^{(t+1)} \sim P(x_i \mid x_{-i}^{(t)}) \propto \exp\left[-\beta E\big(x_i, x_{-i}^{(t)}\big)\right]. \label{eq:gibbs} \end{equation} $$
Here $x_{-i}$ means all degrees of freedom except $x_i$. The proportionality is indicating that you still have to normalize over the allowed values of $x_i$, but you never need the global partition function $Z$ which is big savings.
To recap, each degree of freedom is sampled from its local equilibrium distribution. There are no proposals to reject, so every update is "accepted" simply by construction of the conditional sampling. In many lattice models, the Hamiltonian is a sum of local terms, which means the conditional $P(x_i \mid x_{-i})$ depends only on neighboring variables. So the cost of an update is dominated by very local energy evaluations, not global ones.
In that sense, Gibbs sampling is Boltzmann sampling by conditional decomposition2. It's exact when those conditionals can be written down and sampled efficiently (i.e., the interactions are local), which is the case for a fair amount of models like the Ising model, Potts model, and more general Gibbs random fields in image analysis.
So Boltzmann or Gibbs?
The title was a bit of a mismatch since the Boltzmann distribution is the underlying thing we are trying to sample from, so Boltzmann and Gibbs aren't in competition in any meaningful sense. The main challenge with trying to use the analytical Boltzmann distribution is we can't meaningfully determine the partition function for toy and real-world systems. So what we do instead is:
- Take the Boltzmann distribution as the target equilibrium distribution.
- Construct Markov chains whose stationary distribution is exactly that Boltzmann measure.
- Use different update strategies to actually explore the state space.
Both Metropolis–Hastings and Gibbs sampling are Markov chain Monte Carlo (MCMC) methods that construct Markov chains converging to the same Boltzmann distribution. They differ fundamentally in their update mechanisms:
With Metropolis–Hastings it proceeds by proposing candidate states from a proposal distribution $q(x' \mid x)$ and then accepting or rejecting them based on the energy difference. Gibbs sampling exploits conditional probablity structure by directly sampling from the exact local distributions: $P(x_i \mid x_{-i})$. As a result the the sampled outcome is the update. In systems with local interactions, this makes the process efficient because it avoids global energy evaluations. This means no wanted proposals as in Metropolis-Hastings, but its again only practical when the conditional distributions can be computed efficiently. So Gibbs isn't replacing Boltzmann. It's helping Boltzmann by providing a conditional route to sampling from the same equilibrium distribution.
Gibbs as a special case
Gibbs can be viewed as a special case of Metropolis where the proposal distribution is chosen to be the exact conditional $P(x_i \mid x_{-i})$. In that case the Metropolis–Hastings acceptance probability is identically one, so every proposal is accepted and the chain still converges to the same Boltzmann (Gibbs) distribution.
Both Metropolis–Hastings and Gibbs sampling should approach Eq. $\ref{eq:boltzmann}$ as the number of steps increases. Again they only differ in how they traverse state space and in which cases they are practical.
| Method | Update rule | Uses proposals? | Acceptance step | Typical use |
|---|---|---|---|---|
| Metropolis–Hastings | Random move + accept/reject | Yes | Yes, via $p_{\text{acc}}(x \to x')$ | Very general purpose, can wrap many proposal types |
| Gibbs Sampling | Exact conditional resampling | No | Always accepted | Structured models with tractable conditionals |
Numerical Example
Below is a simple example for a 2-site Ising model with 4 possible microstates. The energy function for a configuration is given by:
$$ E(x) = -J x_1 x_2 $$
where $J$ is the interaction strength and $x_i \in {-1, +1}$ is the spin state at site $i$. We set $J = 1$ and $T = 1$ (so $\beta = 1$). The four possible microstate configurations are:
$$ \begin{align} x_1 &= 1, \quad x_2 = 1 \nonumber \\ x_1 &= 1, \quad x_2 = -1 \nonumber \\ x_1 &= -1, \quad x_2 = 1 \nonumber \\ x_1 &= -1, \quad x_2 = -1 \nonumber \end{align} $$
The Python code below uses both the Metropolis–Hastings and Gibbs sampling algorithms to sample from this system:
import numpy as np
from prettytable import PrettyTable, MARKDOWN
beta, J = 1.0, 1.0
states = np.array([[+1,+1],[+1,-1],[-1,+1],[-1,-1]])
def Ising(s): return -J*s[0]*s[1] # Energy
def boltzmann(E):
p = np.exp([-beta*E(s) for s in states])
return p/p.sum()
def metropolis(E,T=10000):
"""Single-spin flips, accept exp(delta E / kT)"""
s=np.array([1,1]); out=[]
for _ in range(T):
i=np.random.randint(2); s2=s.copy(); s2[i]*=-1
if np.random.rand()<np.exp(-beta*(E(s2)-E(s))): s=s2
out.append(s.copy())
return np.array(out)
def gibbs(E,T=10000):
"""P(s_i | s_j) \\prop exp( J s_i s_j / kT)"""
s=np.array([1,1]); out=[]
for _ in range(T):
for i in (0,1):
p_up=np.exp(beta*J*s[1-i]); p_dn=np.exp(-beta*J*s[1-i])
s[i]=+1 if np.random.rand()<p_up/(p_up+p_dn) else -1
out.append(s.copy())
return np.array(out)
def tally(S):
c=np.zeros(4)
for s in S:
for k,st in enumerate(states):
if (s==st).all(): c[k]+=1
return c/len(S)
tbl = PrettyTable(["Microstate",
"Exact",
f"Metropolis",
f"Gibbs"])
tbl.set_style(MARKDOWN)
[tbl.add_row([i+1] + [f"{x:.3f}" for x in row])
for i,row in enumerate(np.vstack([boltzmann(Ising),
tally(metropolis(Ising)),
tally(gibbs(Ising))]).T)]
Which gives the following table:
| Microstate | Exact | Metropolis | Gibbs |
|---|---|---|---|
| 1 | 0.440 | 0.438 | 0.445 |
| 2 | 0.060 | 0.062 | 0.060 |
| 3 | 0.060 | 0.057 | 0.062 |
| 4 | 0.440 | 0.443 | 0.433 |
As you can see, both the Metropolis–Hastings and Gibbs sampling algorithms converge fairly well to the exact Boltzmann distribution, with the sampled probabilities matching the theoretical values quite closely. If you run longer chains (more samples), you'll see even closer agreement.
Footnotes
-
The partition function is the sum (or integral) of $e^{-\beta E(x)}$ over all microscopic states of the system. For anything beyond very small or highly symmetric systems this is a combinatorial explosion and cannot be evaluated exactly. In practice we work with ratios of probabilities where $Z$ cancels, or we approximate it using specialized methods. Conceptually, $Z$ is the normalization constant that turns $e^{-\beta E(x)}$ into a proper probability distribution. ↩
-
Conditional decomposition just means writing a joint distribution in terms of conditionals. For example, a generic factorization is $P(x_1,\ldots,x_N) = \prod_{i=1}^N P(x_i \mid x_1,\ldots,x_{i-1})$. In Gibbs sampling we exploit the fact that for Gibbs measures of the form $P(x) \propto e^{-\beta E(x)}$ with local interactions, the conditionals $P(x_i \mid x_{-i})$ depend only on a local neighborhood, so they can be computed and sampled efficiently. Gibbs sampling then builds a Markov chain by resampling these local conditionals one variable (or one block) at a time. ↩
-
A Markov chain is a sequence of random variables $X_0, X_1, X_2, \ldots$ such that the probability of transitioning from state $X_i$ to state $X_{i+1}$ depends only on the current state $X_i$ and not on the previous states $X_0, X_1, \ldots, X_{i-1}$. In other words, we do not require full knowledge of the trajectory/history of the system to determine the future state(s). ↩
-
This just means that we weight the probability of the move occuring despite not lowering the energy by the factor $e^{-\beta \Delta E}$. ↩
-
Gibbs is the surname of the Josiah Willard Gibbs, who was a major figure in statistical mechanics and thermodynamics in the late 19th century. ↩
References
[1] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, F. Huang, "A Tutorial on Energy-Based Learning," in Predicting Structured Data, MIT Press (2006).
[2] L. Boltzmann, "Studien ΓΌber das Gleichgewicht der lebendigen Kraft zwischen bewegten materiellen Punkten," Sitzungsberichte der Akademie der Wissenschaften zu Wien 58, 517–560 (1868).
[3] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, "Equation of State Calculations by Fast Computing Machines," Journal of Chemical Physics 21, 1087–1092 (1953). DOI.
[4] W. K. Hastings, "Monte Carlo Sampling Methods Using Markov Chains and Their Applications," Biometrika 57, 97–109 (1970). DOI.
[5] S. Geman, D. Geman, "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images," IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6(6), 721–741 (1984). DOI.
[6] L. D. Landau, E. M. Lifshitz, Statistical Physics, 3rd ed., Course of Theoretical Physics Vol. 5, Pergamon Press, Oxford (1980).