Markov chain Monte Carlo sampling

Taken from coursework for ECE 7751: Graphical Models in Machine Learning, taught by Faramarz Fekri at Georgia Tech, Spring 2023

Inverse CDF sampling

A simple sampling method adopted by many of the standard math libraries is the inverse probability transform: draw $u \sim \text{Unif}(0, 1)$, then draw $x\sim F^{-1}(u)$, where $F^{-1}$ is the inverse of the CDF $F$. Show that $x$ is distributed according to $F$. What is the drawback of this method?


We let $U$ represent the uniform random variable and start with the fact that $$p(U \leq u) = u$$

Showing that $x$ is distributed according to $F$ can be done by showing that $F(x) = p(F^{-1}(U) \leq x)$.


$$ \newcommand{\EE}{\mathbb{E}} \newcommand{\ind}{\mathbb{1}} \newcommand{\answertext}[1]{\textcolor{Green}{\fbox{#1}}} \newcommand{\answer}[1]{\answertext{$#1$}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\comment}[1]{\textcolor{gray}{\textrm{#1}}} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\inv}[1]{\frac{1}{#1}} \newcommand{\abs}[1]{\lvert{#1}\rvert} \newcommand{\norm}[1]{\lVert{#1}\rVert} \newcommand{\lr}[1]{\left(#1\right)} \newcommand{\lrb}[1]{\left[#1\right]} \newcommand{\lrbr}[1]{\lbrace#1\rbrace} \newcommand{\Bx}[0]{\mathbf{x}} \begin{aligned} &p(F^{-1}(U) \leq x) \\ =& p(U \leq F(x)) \\ =& \color{Green}F(x) & \color{gray}\textsf{because } p(U\leq u) = u \\ & & \blacksquare \\ \end{aligned} $$

And we’re done.

The drawback is that we do not always have the inverse CDF in an analytically tractable form; hence, the need for methods like Markov chain Monte Carlo (MCMC).

Combining multiple stationary distributions

Show that if both transition kernels $K_1$ and $K_2$ have $p(\cdot)$ as stationary density, so do $K_1 K_2$ and $\lambda K_1 + (1-\lambda) K_2$. In practice, the former corresponds to sampling from $K_1$ and $K_2$ cyclically and the latter draws either $K_1$ with probability $\lambda$ or $K_2$ otherwise. Although it is not required to show, extension to more than two kernels should be straightforward.

In the continuous case, the cyclic kernel can be defined as composition of functions:

$$ \left(K_1 \circ K_2\right)(z|x)=\int K_2(y|x) K_1(z|y) \mathrm{d} y $$


Cyclic sampling


We let $\pi$ be the vector of probabilities of the stationary distribution and $K$ a matrix where $K_{ji}$ represents the transition probability from $x_i$ to $x_j$.

$$ \begin{aligned} \pi_j &= \sum_i \pi_i K_{ji} \\ \pi &= K \pi \\ \pi &=K_1{\color{Orange}\pi} =K_2\pi \\ &= K_1{\color{Orange}K_2\pi} \\ &= {\color{Green}K_1K_2}\pi \\ \end{aligned} $$
$$ \begin{aligned} p(\cdot) &= \int_x p(x)K_1(\cdot|x)dx = \int_x p(x)K_2(\cdot|x)dx \\ p(z)&= \int_y {\color{Orange}p(y)}K_1(z|y)dy \\ &= \int_y {\color{Orange}\int_x p(x)K_2(y|x)dx}K_1(z|y)dy \\ &= \int_x p(x){\color{Teal} \int_y K_2(y|x)K_1(z|y)dy}dx & \color{Gray}\textsf {rearranging}\\ &= \int_x p(x){\color{Teal}(K_1 \circ K_2)(z|x)}dx & \color{Green}\blacksquare \end{aligned} $$

Sampling with a mixture of kernels

$$ \begin{aligned} \lambda\pi &= \lambda K_1\pi \\ (1-\lambda)\pi &= (1-\lambda)K_2\pi \\ \pi &= \lambda K_1\pi + (1-\lambda)K_2\pi & \color{Gray}\textsf{add first two lines together}\\ \pi &= \underbrace{(\lambda K_1 + (1-\lambda)K_2)}_\textsf{combined kernel}\pi & \color{Green}\blacksquare\\ \end{aligned} $$
$$ \begin{aligned} \lambda p(y) &= \lambda \int_x p(x)K_1(y|x)dx \\ (1-\lambda)p(y) &= (1-\lambda)\int_x p(x)K_2(y|x)dx \\ p(y) &= \lambda \int_x p(x)K_1(y|x)dx + (1-\lambda)\int_x p(x)K_2(y|x)dx & \color{Gray}\textsf{add first two lines together}\\ p(y) &= \int_x p(x)\underbrace{(\lambda K_1(y|x) + (1-\lambda)K_2(y|x))}_\textsf{combined kernel}dx & \color{Green}\blacksquare\\ \end{aligned} $$

Metropolis-Hastings sampling produces a stationary distribution equal to the target

Recall MH sampling for target distribution $p(x)$ using proposal $q(x|y)$: at state $s$, first draw $t \sim q(t|s)$, then accept t with probability $$ A(t|s) = \min\left(1, \frac{\tilde p(t)q(s|t)}{\tilde p(s)q(t|s)}\right) $$

where $p(x)$ is the unnormalized target distribution. Show that $p(x)$ is the stationary distribution of the Markov chain defined by this procedure.

Consider both continuous and discrete cases.


We will show that the detailed balance equation $p(t)T(s|t)=p(s)T(t|s)$ is satisfied, which implies that $p(x)$ is the stationary distribution of the Markov chain (see Koller & Friedman1, Proposition 12.3).

This holds for both continuous ($p(t)=\int_s p(s)T(t|s)ds$) and discrete ($p(t)=\sum_s p(s)T(t|s)$) state spaces.

When $s\neq t$

$$ \begin{aligned} p(s)T(t|s) &= p(s)q(t|s)A(t|s) \\ &= p(s)q(t|s) \min\left(1,\ \frac{p(t)q(s|t)}{p(s)q(t|s)}\right) &\color{gray}\frac{\tilde p(t)}{\tilde p(s)}=\frac{p(t)}{p(s)} \\ &= \min(p(s)q(t|s),\ p(t)q(s|t)) \\ &= p(t)q(s|t)\min\left(\frac{p(s)q(t|s)}{p(t)q(s|t)},\ 1\right) \\ &= p(t)q(s|t)A(s|t) \\ &= \answer{p(t)T(s|t)} & \blacksquare\\ \end{aligned} $$

When $s=t$

Though $T(s|t)$ has a different form when $s=t$, it does not matter. The proof is trivial; we can simply exchange $s$ and $t$ on one side to get the other: $$ \answer{p(s)T(t|s)=p(s)T(s|s)=p(t)T(s|t)}$$

Gibbs sampling produces a stationary distribution equal to the target

Recall Gibbs sampling for target distribution $p(x) = p(x_1 , \ldots, x_d )$: for each $j \in {1,\ldots,d}$, draw $t \sim p(x_j|\textsf{rest})$ and set $x_j=t$. Show that $p(x)$ is the stationary distribution of the Markov chain defined by this procedure.


Again, we prove that the detailed balance equation holds; in the case of Gibbs case, it is


where $T(x_{-j},x_j’|x_{-j},x_j) = p(x_j’|x_{-j})$ and $x_{-j} = \lrbr{x_i: i\neq j}$.

We start with the left side and show it is equal to the right:

$$ \begin{aligned} p(x_{-j},x_j)T(x_{-j},x_j'|x_{-j},x_j) &= p(x_{-j},x_j)p(x_j'|x_{-j}) \\ &= p(x_{-j})p(x_j|x_{-j})p(x_j'|x_{-j}) \\ &= p(x_{-j}, x_j')p(x_j|x_{-j}) \\ &= \answer{p(x_{-j},x_j')T(x_{-j},x_j|x_{-j},x_j')} \\ \end{aligned} $$
Kyle Johnsen
Kyle Johnsen
PhD Candidate, Biomedical Engineering

My research interests include applying principles from the brain to improve machine learning.