Markov chain Monte Carlo sampling

Taken from coursework for ECE 7751: Graphical Models in Machine Learning, taught by Faramarz Fekri at Georgia Tech, Spring 2023

Jan 9, 2024 4 min read coursework

Inverse CDF sampling

A simple sampling method adopted by many of the standard math libraries is the inverse probability transform: draw $u \sim \text{Unif}(0, 1)$, then draw $x\sim F^{-1}(u)$, where $F^{-1}$ is the inverse of the CDF $F$. Show that $x$ is distributed according to $F$. What is the drawback of this method?

Solution

We let $U$ represent the uniform random variable and start with the fact that $$p(U \leq u) = u$$

Showing that $x$ is distributed according to $F$ can be done by showing that $F(x) = p(F^{-1}(U) \leq x)$.

So,

$$ \newcommand{\EE}{\mathbb{E}} \newcommand{\ind}{\mathbb{1}} \newcommand{\answertext}[1]{\textcolor{Green}{\fbox{#1}}} \newcommand{\answer}[1]{\answertext{$#1$}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\comment}[1]{\textcolor{gray}{\textrm{#1}}} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\inv}[1]{\frac{1}{#1}} \newcommand{\abs}[1]{\lvert{#1}\rvert} \newcommand{\norm}[1]{\lVert{#1}\rVert} \newcommand{\lr}[1]{\left(#1\right)} \newcommand{\lrb}[1]{\left[#1\right]} \newcommand{\lrbr}[1]{\lbrace#1\rbrace} \newcommand{\Bx}[0]{\mathbf{x}} \begin{aligned} &p(F^{-1}(U) \leq x) \\ =& p(U \leq F(x)) \\ =& \color{Green}F(x) & \color{gray}\textsf{because } p(U\leq u) = u \\ & & \blacksquare \\ \end{aligned} $$

And we’re done.

The drawback is that we do not always have the inverse CDF in an analytically tractable form; hence, the need for methods like Markov chain Monte Carlo (MCMC).

Combining multiple stationary distributions

Show that if both transition kernels $K_1$ and $K_2$ have $p(\cdot)$ as stationary density, so do $K_1 K_2$ and $\lambda K_1 + (1-\lambda) K_2$. In practice, the former corresponds to sampling from $K_1$ and $K_2$ cyclically and the latter draws either $K_1$ with probability $\lambda$ or $K_2$ otherwise. Although it is not required to show, extension to more than two kernels should be straightforward.

In the continuous case, the cyclic kernel can be defined as composition of functions:

$$ \left(K_1 \circ K_2\right)(z|x)=\int K_2(y|x) K_1(z|y) \mathrm{d} y $$

Solution

Cyclic sampling

Discrete

We let $\pi$ be the vector of probabilities of the stationary distribution and $K$ a matrix where $K_{ji}$ represents the transition probability from $x_i$ to $x_j$.

$$ \begin{aligned} \pi_j &= \sum_i \pi_i K_{ji} \\ \pi &= K \pi \\ \pi &=K_1{\color{Orange}\pi} =K_2\pi \\ &= K_1{\color{Orange}K_2\pi} \\ &= {\color{Green}K_1K_2}\pi \\ \end{aligned} $$

Continuous

$$ \begin{aligned} p(\cdot) &= \int_x p(x)K_1(\cdot|x)dx = \int_x p(x)K_2(\cdot|x)dx \\ p(z)&= \int_y {\color{Orange}p(y)}K_1(z|y)dy \\ &= \int_y {\color{Orange}\int_x p(x)K_2(y|x)dx}K_1(z|y)dy \\ &= \int_x p(x){\color{Teal} \int_y K_2(y|x)K_1(z|y)dy}dx & \color{Gray}\textsf {rearranging}\\ &= \int_x p(x){\color{Teal}(K_1 \circ K_2)(z|x)}dx & \color{Green}\blacksquare \end{aligned} $$

Sampling with a mixture of kernels

Discrete

$$ \begin{aligned} \lambda\pi &= \lambda K_1\pi \\ (1-\lambda)\pi &= (1-\lambda)K_2\pi \\ \pi &= \lambda K_1\pi + (1-\lambda)K_2\pi & \color{Gray}\textsf{add first two lines together}\\ \pi &= \underbrace{(\lambda K_1 + (1-\lambda)K_2)}_\textsf{combined kernel}\pi & \color{Green}\blacksquare\\ \end{aligned} $$

Continuous

$$ \begin{aligned} \lambda p(y) &= \lambda \int_x p(x)K_1(y|x)dx \\ (1-\lambda)p(y) &= (1-\lambda)\int_x p(x)K_2(y|x)dx \\ p(y) &= \lambda \int_x p(x)K_1(y|x)dx + (1-\lambda)\int_x p(x)K_2(y|x)dx & \color{Gray}\textsf{add first two lines together}\\ p(y) &= \int_x p(x)\underbrace{(\lambda K_1(y|x) + (1-\lambda)K_2(y|x))}_\textsf{combined kernel}dx & \color{Green}\blacksquare\\ \end{aligned} $$

Metropolis-Hastings sampling produces a stationary distribution equal to the target

Recall MH sampling for target distribution $p(x)$ using proposal $q(x|y)$: at state $s$, first draw $t \sim q(t|s)$, then accept t with probability $$ A(t|s) = \min\left(1, \frac{\tilde p(t)q(s|t)}{\tilde p(s)q(t|s)}\right) $$

where $p(x)$ is the unnormalized target distribution. Show that $p(x)$ is the stationary distribution of the Markov chain defined by this procedure.

Consider both continuous and discrete cases.

Solution

We will show that the detailed balance equation $p(t)T(s|t)=p(s)T(t|s)$ is satisfied, which implies that $p(x)$ is the stationary distribution of the Markov chain (see Koller & Friedman¹, Proposition 12.3).

This holds for both continuous ($p(t)=\int_s p(s)T(t|s)ds$) and discrete ($p(t)=\sum_s p(s)T(t|s)$) state spaces.

When $s\neq t$

$$ \begin{aligned} p(s)T(t|s) &= p(s)q(t|s)A(t|s) \\ &= p(s)q(t|s) \min\left(1,\ \frac{p(t)q(s|t)}{p(s)q(t|s)}\right) &\color{gray}\frac{\tilde p(t)}{\tilde p(s)}=\frac{p(t)}{p(s)} \\ &= \min(p(s)q(t|s),\ p(t)q(s|t)) \\ &= p(t)q(s|t)\min\left(\frac{p(s)q(t|s)}{p(t)q(s|t)},\ 1\right) \\ &= p(t)q(s|t)A(s|t) \\ &= \answer{p(t)T(s|t)} & \blacksquare\\ \end{aligned} $$

When $s=t$

Though $T(s|t)$ has a different form when $s=t$, it does not matter. The proof is trivial; we can simply exchange $s$ and $t$ on one side to get the other: $$ \answer{p(s)T(t|s)=p(s)T(s|s)=p(t)T(s|t)}$$

Gibbs sampling produces a stationary distribution equal to the target

Recall Gibbs sampling for target distribution $p(x) = p(x_1 , \ldots, x_d )$: for each $j \in {1,\ldots,d}$, draw $t \sim p(x_j|\textsf{rest})$ and set $x_j=t$. Show that $p(x)$ is the stationary distribution of the Markov chain defined by this procedure.

Solution

Again, we prove that the detailed balance equation holds; in the case of Gibbs case, it is

$$p(x_{-j},x_j)T(x_{-j},x_j'|x_{-j},x_j)=p(x_{-j},x_j')T(x_{-j},x_j|x_{-j},x_j')$$

where $T(x_{-j},x_j’|x_{-j},x_j) = p(x_j’|x_{-j})$ and $x_{-j} = \lrbr{x_i: i\neq j}$.

We start with the left side and show it is equal to the right:

$$ \begin{aligned} p(x_{-j},x_j)T(x_{-j},x_j'|x_{-j},x_j) &= p(x_{-j},x_j)p(x_j'|x_{-j}) \\ &= p(x_{-j})p(x_j|x_{-j})p(x_j'|x_{-j}) \\ &= p(x_{-j}, x_j')p(x_j|x_{-j}) \\ &= \answer{p(x_{-j},x_j')T(x_{-j},x_j|x_{-j},x_j')} \\ \end{aligned} $$

Probabilistic Graphical Models: Principles and Techniques. Koller & Friedman, 2009 ↩︎

probabilistic graphical models

Kyle Johnsen

PhD Candidate
Biomedical Engineering

My research interests include applying principles from the brain to improve machine learning.

Markov chain Monte Carlo sampling

Inverse CDF sampling

Solution

Combining multiple stationary distributions

Solution

Cyclic sampling

Discrete

Continuous

Sampling with a mixture of kernels

Discrete

Continuous

Metropolis-Hastings sampling produces a stationary distribution equal to the target

Solution

When $s\neq t$

When $s=t$

Gibbs sampling produces a stationary distribution equal to the target

Solution

Kyle Johnsen

PhD CandidateBiomedical Engineering

PhD Candidate
Biomedical Engineering