Probabilistic Graphical Models |

Mean field variational inference

Thu, 11 Jan 2024 00:00:00 +0000

In this problem, you will investigate mean field approximate inference algorithms (Koller & Friedman¹ 11.5). Consider the Markov network in the above figure. Define edge potentials $\phi_{ij}(x_i,x_j)$ for all edges $(x_i , x_j)$ in the graph. We can write

$$ P\left(x_1, \ldots, x_{12}\right)=\frac{1}{Z} \prod_{(i, j) \in E} \phi_{i j}\left(x_i, x_j\right) $$

Fully factored mean field

Assume a fully factored mean field approximation $Q$ (b in figure), parameterized by node potentials $Q_i$.

In both of the cases below, please expand out any expectations in the formulas (your answer should be in terms of $Q_i$ and $\phi_{ij}$).

Write down the update formulas for $Q_1(X_1)$ and $Q_6(X_6)$.

Solution

$Q_1(X_1)$

Using the update formula for a fully factored mean field, where $D_j$ represents clique $j$,

$$ \begin{aligned} Q_i(X_i) &= \frac{1}{Z_i} \exp\left(\sum_{D_j: X \in D_j} \mathbb{E}_{Q_{-i}} \ln \phi(D_j)\right) \\ Q_1(X_1) &= \frac{1}{Z_1} \exp\left(\mathbb{E}_{Q_{2}} \ln \phi(X_1,X_2) + \mathbb{E}_{Q_{5}} \ln \phi(X_1,X_5)\right) \\ &= \color{Green} \frac{1}{Z_1} \exp\left(\sum_{X_2} Q_2(X_2) \ln\phi(X_1,X_2) + \sum_{X_5} Q_5(X_5) \ln\phi(X_1,X_5)\right) \\ \end{aligned} \newcommand{\EE}{\mathbb{E}} \newcommand{\ind}{\mathbb{1}} \newcommand{\answertext}[1]{\textcolor{Green}{\fbox{#1}}} \newcommand{\answer}[1]{\answertext{$#1$}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\comment}[1]{\textcolor{gray}{\textrm{#1}}} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\inv}[1]{\frac{1}{#1}} \newcommand{\abs}[1]{\lvert{#1}\rvert} \newcommand{\norm}[1]{\lVert{#1}\rVert} \newcommand{\lr}[1]{\left(#1\right)} \newcommand{\lrb}[1]{\left[#1\right]} \newcommand{\lrbr}[1]{\lbrace#1\rbrace} \newcommand{\Bx}[0]{\mathbf{x}} $$

$Q_6(X_6)$

The derivation is similar to the previous part:

$$ \begin{multline} Q_6(X_6) = \color{Green} \frac{1}{Z_1} \exp\Biggl( \sum_{X_2} Q_2(X_2)\ln \phi(X_2,X_6) + \sum_{X_5} Q_5(X_5) \ln\phi(X_5,X_6) \\ \color{Green} + \sum_{X_7} Q_7(X_7)\ln \phi(X_6,X_7) + \sum_{X_{10}} Q_{10}(X_{10})\ln \phi(X_6,X_{10}) \Biggr) \end{multline} $$

Structured mean field

Now we consider a structured mean field approximation $Q$ (c in figure), parameterized by edge potentials $\psi_{ij}(x_i,x_j)$ for each edge $(x_i , x_j)$.

$\psi_{12}(x_1,x_2)$

Write down the update formula for $\psi_{12}(x_1,x_2)$ up to a proportionality constant. This time, you can write it in terms of expected values, but do not include unnecessary terms.

Solution

We start with equation (11.62) from Koller & Friedman¹: $$ \psi_j\left(D_j\right) \propto \exp \left\{\sum_{\phi \in A_j} \mathbb{E}_{\mathcal{X} \sim Q}\left[\ln \phi \mid D_j\right]-\sum_{\psi_k \in B_j} \mathbb{E}_{\mathcal{X} \sim Q}\left[\ln \psi_k \mid D_j\right]\right\}, $$

where

$$ A_j=\left\{\phi \in \Phi: Q \not \models\left(\mathbf{U}_\phi \perp \mathbf{D}_j\right)\right\} $$ $$ B_j=\left\{\psi_k: Q \not \models\left(\mathbf{D}_k \perp \mathbf{D}_j\right)\right\}-\left\{\mathbf{D}_j\right\}. $$

$Q \not\models$ means that the statement is not true in $Q$ and $\mathbf{U}_\phi$ refers to all variables involved in the factor $\phi$. These $A_j$ and $B_j$ sets essentially include factors that are not independent from the clique $D_j$ in our approximate distribution $Q$.

Thus, we have

$$ \newcommand{\EE}{\mathbb{E}} \begin{multline} \psi_{12}(x_1,x_2) \propto \\ \color{Green} \exp \left( \begin{aligned} &\mathbb{E}_Q[\ln\phi_{12}(x_1,x_2)|x_1,x_2] + \mathbb{E}_Q[\ln\phi_{23}(x_2,x_3)|x_1,x_2] + \mathbb{E}_Q[\ln\phi_{34}(x_3,x_4)|x_1,x_2]) \\ &+ \mathbb{E}_Q[\ln\phi_{15}(x_1,x_5)|x_1,x_2] + \mathbb{E}_Q[\ln\phi_{26}(x_2,x_6)|x_1,x_2] + \mathbb{E}_Q[\ln\phi_{37}(x_3,x_7)|x_1,x_2]) + \mathbb{E}_Q[\ln\phi_{48}(x_4,x_8)|x_1,x_2]) \\ &- \mathbb{E}_Q[\ln\psi_{23}(x_2,x_3)|x_1,x_2] - \mathbb{E}_Q[\ln\psi_{34}(x_3,x_4)|x_1,x_2]) \\ \end{aligned} \right) \end{multline} $$

$\mathbb{E}_Q[\ln\phi_{26}(X_2,X_6)|x_1,x_2]$

Write out the formula for $\mathbb{E}_Q[\ln\phi_{26}(X_2,X_6)|x_1,x_2]$. Make sure to show how you would calculate the distribution that this expectation is over.

Solution

We only need to take the expectation over the variables in the function:

$$ \EE_{Q(X_1,\ldots,X_N)} f(X_1) = \EE_{Q(X_1)}f(X_1) $$

Also, we need not take the expectation over a variable being conditioned on:

$$ \EE_{Q(X_1, X_2|x_1)}[f(X_1,X_2)] = \EE_{Q(X_2|x_1)}[f(x_1, X_2)] $$

Hence, $$ \begin{aligned} &\EE_{Q(X_1,\ldots,X_n)}[\ln\phi_{26}(X_2,X_6)|x_1,x_2] \\ &= \EE_{Q(X_1,\ldots,X_n|x_1,x_2)}\ln\phi_{26}(X_2,X_6) \\ &= \EE_{Q(X_6|x_1,x_2)}\ln\phi_{26}(X_2,X_6). \\ &= \EE_{Q(X_6|x_1,x_2)}\ln\phi_{26}(x_2,X_6). \\ &= \EE_{Q(X_6)}\ln\phi_{26}(x_2,X_6) &\color{Gray}\textsf{since $X_6 \perp X_1,X_2$ in $Q$} \\ \end{aligned} $$ $$ \begin{aligned} Q(X_6) &= \sum_{\mathcal{X}\backslash\lrbr{X_6}} Q(X_1,\ldots,X_n) \\ &= \sum_{x_5,x_7,x_8} Q(X_5,X_6,X_7,X_8) \\ &= \inv{Z_{C_2}} \sum_{x_5,x_7,x_8} \psi(X_5,X_6)\psi(X_6,X_7)\psi(X_7,X_8) \\ \end{aligned} $$ $$ \color{Green} \EE_Q[\ln\phi_{26}(X_2,X_6)|x_1,x_2] = \inv{Z_{C_2}}\sum_{x_5,x_6,x_7,x_8} \ln\phi(x_2,x_6)\psi(x_5,x_6)\psi(x_6,x_7)\psi(x_7,x_8)\\ $$

$\EE_Q[\ln\phi_{15}(X_1,X_5)|x_1,x_2]$

Write out the formula for $\EE_Q[\ln\phi_{15}(X_1,X_5)|x_1,x_2]$. Again, show how you would evaluate distribution $Q$.

Solution

We follow the same steps and arrive at a very similar answer:

$$ \color{Green} \EE_Q[\ln\phi_{15}(X_1,X_5)|x_1,x_2] = \inv{Z_{C_2}} \sum_{x_5,x_6,x_7,x_8} \ln\phi(x_1,x_5)\psi(x_5,x_6)\psi(x_6,x_7)\psi(x_7,x_8)\\ $$

↩︎ ↩︎

Approximate inference via Gibbs sampling

Tue, 09 Jan 2024 00:00:00 +0000

Consider a setting in which there are $D$ diseases and a patient either has ($d_i=1$) or does not have ($d_i=0$) each disease. The hospital can measure $S$ symptoms, where $s_j=1$ when the patient has the symptom and $s_j=0$ otherwise. A simple Bayesian network for this setting is given by:

$$ p\left(s_1, \ldots, s_S, d_1, \ldots, d_D\right)=\prod_{j=1}^D p\left(s_j \mid \mathbf{d}\right) \prod_{i=1}^D p\left(d_i\right) $$

where $\mathbf{d}=\left(d_1, \ldots, d_D\right)^T$ and

$$ p\left(s_j=1 \mid \mathbf{d}\right)=\sigma\left(\mathbf{w}_j^T \mathbf{d}+b_j\right) $$

where $\sigma(x) = 1/(1+e^{-x})$.

In the above $\mathbf{w}_j$ is a vector of parameters relating symptom $j$ to the diseases and $b_j$ is related to the prevalence of the symptom. The hospital provides the collection of parameters $W$ and $b$, the prior disease probabilities $p$ (with $p(d_i = 1) = p_i$) and a vector $s$ of symptoms for the patient; see SymptomDisease.mat

Use Gibbs sampling to estimate (using a reasonable amount of burn-in and sub-sampling) to estimate the vector

$$ \bigl[p(d_1=1|s),\ldots,p(d_D=1|s)\bigr] $$

Solution

from scipy.io import loadmat
import numpy as np

data = loadmat('SymptomDisease.mat')
for var in ['W', 'b', 'p', 's']:
 globals()[var] = data[var].squeeze()
print(f'W.shape = {W.shape}')
S, D = W.shape
assert S == s.size == b.size
assert D == p.size
p.shape

W.shape = (200, 50)
(50,)

To perform Gibbs sampling, we need $p(d_i|d_{-i},s)$, where $d_{-i}$ refers to $\{d_1,\ldots,d_{i-1},d_{i+1},\ldots,d_D\}$. We can compute this using Bayes’ rule:

$$ p(d_i|d_{-i},s) = \frac{p(s,d)}{p(s,d_{-i})} = \frac{p(d)p(s|d)}{p(d_{-i})p(s|d_{-i})} = \frac{p(d)p(s|d)}{\sum_{d_i} p(d)p(s|d)}$$

We can compute the denominator by summing over the numerator when $d_i=1$ and $d_i=0$.

def σ(x):
 return 1 / (1 + np.exp(-x))

def likelihood(d) -> float:
 # multiply symptom probs p(s_j|d)
 ps = σ(W@d + b)
 L = np.prod(ps**s * (1 - ps)**(1 - s))
 # and disease probs p(d_i)
 L *= np.prod(p**d * (1 - p)**(1 - d))
 return L

def sample_d(d, rng):
 for i in range(D):
 d[i] = 1
 p_d_s_di1 = likelihood(d)
 d[i] = 0
 p_d_s_di0 = likelihood(d)
 condp_di = p_d_s_di1 / (p_d_s_di1 + p_d_s_di0)
 d[i] = rng.binomial(1, condp_di)
 return d
sample_d((np.random.uniform(size=D) < p).astype(int), np.random.default_rng())

array([1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
0, 0, 1, 1, 0, 1])

We can use the complete data likelihood $p(s, d)$ not only to sample, but also for a principled approach to decide when burn-in is complete.

And we’ll continue sampling until the disease posterior probabilities converge.

def gibbs_sample(samples, rng):
 d = (rng.uniform(size=D) < p).astype(int)
 iters = 0

 # burn-in period
 iters_without_improvement = 0
 curr_lklhd = likelihood(d)
 lklhd_max = -np.inf
 while iters_without_improvement < 10:
 iters += 1
 d = sample_d(d, rng)
 curr_lklhd = likelihood(d)
 if curr_lklhd > lklhd_max:
 iters_without_improvement = 0
 lklhd_max = curr_lklhd
 else:
 iters_without_improvement += 1
 print(f"burn-in period complete after {iters} iterations")
 burn_in_iters = iters

 print('Using burn-in time to determine sampling frequency')
 print(f'Taking every {burn_in_iters}th sample')

 iters = 0
 dtot = np.zeros(D)
 while iters < samples:
 iters += 1
 for _ in range(burn_in_iters):
 d = sample_d(d, rng)
 dtot += d
 print(f"Converged in {iters} samples (taken out of {burn_in_iters*iters} total samples)")
 print(dtot)
 return dtot / iters

posteriors = gibbs_sample(500, np.random.default_rng())
posteriors

burn-in period complete after 19 iterations
Using burn-in time to determine sampling frequency
Taking every 19th sample
Converged in 500 samples (taken out of 9500 total samples)
[ 3. 498. 9. 500. 333. 4. 8. 0. 3. 500. 1. 500. 500. 500.
492. 481. 491. 453. 500. 496. 46. 359. 500. 499. 5. 500. 12. 0.
500. 0. 0. 47. 6. 500. 0. 1. 495. 0. 0. 0. 497. 499.
500. 500. 0. 5. 499. 499. 0. 500.]
array([0.006, 0.996, 0.018, 1. , 0.666, 0.008, 0.016, 0. , 0.006,
1. , 0.002, 1. , 1. , 1. , 0.984, 0.962, 0.982, 0.906,
1. , 0.992, 0.092, 0.718, 1. , 0.998, 0.01 , 1. , 0.024,
0. , 1. , 0. , 0. , 0.094, 0.012, 1. , 0. , 0.002,
0.99 , 0. , 0. , 0. , 0.994, 0.998, 1. , 1. , 0. ,
0.01 , 0.998, 0.998, 0. , 1. ])

 for i, p in enumerate(posteriors):
 print(f'p(d_{i+1}|s)\t= {p:.3f}')

p(d_1|s) = 0.006
p(d_2|s) = 0.996
p(d_3|s) = 0.018
p(d_4|s) = 1.000
p(d_5|s) = 0.666
p(d_6|s) = 0.008
p(d_7|s) = 0.016
p(d_8|s) = 0.000
p(d_9|s) = 0.006
p(d_10|s) = 1.000
p(d_11|s) = 0.002
p(d_12|s) = 1.000
p(d_13|s) = 1.000
p(d_14|s) = 1.000
p(d_15|s) = 0.984
p(d_16|s) = 0.962
p(d_17|s) = 0.982
p(d_18|s) = 0.906
p(d_19|s) = 1.000
p(d_20|s) = 0.992
p(d_21|s) = 0.092
p(d_22|s) = 0.718
p(d_23|s) = 1.000
p(d_24|s) = 0.998
p(d_25|s) = 0.010
p(d_26|s) = 1.000
p(d_27|s) = 0.024
p(d_28|s) = 0.000
p(d_29|s) = 1.000
p(d_30|s) = 0.000
p(d_31|s) = 0.000
p(d_32|s) = 0.094
p(d_33|s) = 0.012
p(d_34|s) = 1.000
p(d_35|s) = 0.000
p(d_36|s) = 0.002
p(d_37|s) = 0.990
p(d_38|s) = 0.000
p(d_39|s) = 0.000
p(d_40|s) = 0.000
p(d_41|s) = 0.994
p(d_42|s) = 0.998
p(d_43|s) = 1.000
p(d_44|s) = 1.000
p(d_45|s) = 0.000
p(d_46|s) = 0.010
p(d_47|s) = 0.998
p(d_48|s) = 0.998
p(d_49|s) = 0.000
p(d_50|s) = 1.000

Markov chain Monte Carlo sampling

Tue, 09 Jan 2024 00:00:00 +0000

Inverse CDF sampling

A simple sampling method adopted by many of the standard math libraries is the inverse probability transform: draw $u \sim \text{Unif}(0, 1)$, then draw $x\sim F^{-1}(u)$, where $F^{-1}$ is the inverse of the CDF $F$. Show that $x$ is distributed according to $F$. What is the drawback of this method?

Solution

We let $U$ represent the uniform random variable and start with the fact that

$$p(U \leq u) = u$$

Showing that $x$ is distributed according to $F$ can be done by showing that $F(x) = p(F^{-1}(U) \leq x)$.

So,

$$ \newcommand{\EE}{\mathbb{E}} \newcommand{\ind}{\mathbb{1}} \newcommand{\answertext}[1]{\textcolor{Green}{\fbox{#1}}} \newcommand{\answer}[1]{\answertext{$#1$}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\comment}[1]{\textcolor{gray}{\textrm{#1}}} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\inv}[1]{\frac{1}{#1}} \newcommand{\abs}[1]{\lvert{#1}\rvert} \newcommand{\norm}[1]{\lVert{#1}\rVert} \newcommand{\lr}[1]{\left(#1\right)} \newcommand{\lrb}[1]{\left[#1\right]} \newcommand{\lrbr}[1]{\lbrace#1\rbrace} \newcommand{\Bx}[0]{\mathbf{x}} \begin{aligned} &p(F^{-1}(U) \leq x) \\ =& p(U \leq F(x)) \\ =& \color{Green}F(x) & \color{gray}\textsf{because } p(U\leq u) = u \\ & & \blacksquare \\ \end{aligned} $$

And we’re done.

The drawback is that we do not always have the inverse CDF in an analytically tractable form; hence, the need for methods like Markov chain Monte Carlo (MCMC).

Combining multiple stationary distributions

Show that if both transition kernels $K_1$ and $K_2$ have $p(\cdot)$ as stationary density, so do $K_1 K_2$ and $\lambda K_1 + (1-\lambda) K_2$. In practice, the former corresponds to sampling from $K_1$ and $K_2$ cyclically and the latter draws either $K_1$ with probability $\lambda$ or $K_2$ otherwise. Although it is not required to show, extension to more than two kernels should be straightforward.

In the continuous case, the cyclic kernel can be defined as composition of functions:

$$ \left(K_1 \circ K_2\right)(z|x)=\int K_2(y|x) K_1(z|y) \mathrm{d} y $$

Solution

Cyclic sampling

Discrete

We let $\pi$ be the vector of probabilities of the stationary distribution and $K$ a matrix where $K_{ji}$ represents the transition probability from $x_i$ to $x_j$.

$$ \begin{aligned} \pi_j &= \sum_i \pi_i K_{ji} \\ \pi &= K \pi \\ \pi &=K_1{\color{Orange}\pi} =K_2\pi \\ &= K_1{\color{Orange}K_2\pi} \\ &= {\color{Green}K_1K_2}\pi \\ \end{aligned} $$

Continuous

$$ \begin{aligned} p(\cdot) &= \int_x p(x)K_1(\cdot|x)dx = \int_x p(x)K_2(\cdot|x)dx \\ p(z)&= \int_y {\color{Orange}p(y)}K_1(z|y)dy \\ &= \int_y {\color{Orange}\int_x p(x)K_2(y|x)dx}K_1(z|y)dy \\ &= \int_x p(x){\color{Teal} \int_y K_2(y|x)K_1(z|y)dy}dx & \color{Gray}\textsf {rearranging}\\ &= \int_x p(x){\color{Teal}(K_1 \circ K_2)(z|x)}dx & \color{Green}\blacksquare \end{aligned} $$

Sampling with a mixture of kernels

Discrete

$$ \begin{aligned} \lambda\pi &= \lambda K_1\pi \\ (1-\lambda)\pi &= (1-\lambda)K_2\pi \\ \pi &= \lambda K_1\pi + (1-\lambda)K_2\pi & \color{Gray}\textsf{add first two lines together}\\ \pi &= \underbrace{(\lambda K_1 + (1-\lambda)K_2)}_\textsf{combined kernel}\pi & \color{Green}\blacksquare\\ \end{aligned} $$

Continuous

$$ \begin{aligned} \lambda p(y) &= \lambda \int_x p(x)K_1(y|x)dx \\ (1-\lambda)p(y) &= (1-\lambda)\int_x p(x)K_2(y|x)dx \\ p(y) &= \lambda \int_x p(x)K_1(y|x)dx + (1-\lambda)\int_x p(x)K_2(y|x)dx & \color{Gray}\textsf{add first two lines together}\\ p(y) &= \int_x p(x)\underbrace{(\lambda K_1(y|x) + (1-\lambda)K_2(y|x))}_\textsf{combined kernel}dx & \color{Green}\blacksquare\\ \end{aligned} $$

Metropolis-Hastings sampling produces a stationary distribution equal to the target

Recall MH sampling for target distribution $p(x)$ using proposal $q(x|y)$: at state $s$, first draw $t \sim q(t|s)$, then accept t with probability

$$ A(t|s) = \min\left(1, \frac{\tilde p(t)q(s|t)}{\tilde p(s)q(t|s)}\right) $$

where $p(x)$ is the unnormalized target distribution. Show that $p(x)$ is the stationary distribution of the Markov chain defined by this procedure.

Consider both continuous and discrete cases.

Solution

We will show that the detailed balance equation $p(t)T(s|t)=p(s)T(t|s)$ is satisfied, which implies that $p(x)$ is the stationary distribution of the Markov chain (see Koller & Friedman¹, Proposition 12.3).

This holds for both continuous ($p(t)=\int_s p(s)T(t|s)ds$) and discrete ($p(t)=\sum_s p(s)T(t|s)$) state spaces.

When $s\neq t$

$$ \begin{aligned} p(s)T(t|s) &= p(s)q(t|s)A(t|s) \\ &= p(s)q(t|s) \min\left(1,\ \frac{p(t)q(s|t)}{p(s)q(t|s)}\right) &\color{gray}\frac{\tilde p(t)}{\tilde p(s)}=\frac{p(t)}{p(s)} \\ &= \min(p(s)q(t|s),\ p(t)q(s|t)) \\ &= p(t)q(s|t)\min\left(\frac{p(s)q(t|s)}{p(t)q(s|t)},\ 1\right) \\ &= p(t)q(s|t)A(s|t) \\ &= \answer{p(t)T(s|t)} & \blacksquare\\ \end{aligned} $$

When $s=t$

Though $T(s|t)$ has a different form when $s=t$, it does not matter. The proof is trivial; we can simply exchange $s$ and $t$ on one side to get the other:

$$ \answer{p(s)T(t|s)=p(s)T(s|s)=p(t)T(s|t)}$$

Gibbs sampling produces a stationary distribution equal to the target

Recall Gibbs sampling for target distribution $p(x) = p(x_1 , \ldots, x_d )$: for each $j \in {1,\ldots,d}$, draw $t \sim p(x_j|\textsf{rest})$ and set $x_j=t$. Show that $p(x)$ is the stationary distribution of the Markov chain defined by this procedure.

Solution

Again, we prove that the detailed balance equation holds; in the case of Gibbs case, it is

$$p(x_{-j},x_j)T(x_{-j},x_j'|x_{-j},x_j)=p(x_{-j},x_j')T(x_{-j},x_j|x_{-j},x_j')$$

where $T(x_{-j},x_j'|x_{-j},x_j) = p(x_j'|x_{-j})$ and $x_{-j} = \lrbr{x_i: i\neq j}$.

We start with the left side and show it is equal to the right:

$$ \begin{aligned} p(x_{-j},x_j)T(x_{-j},x_j'|x_{-j},x_j) &= p(x_{-j},x_j)p(x_j'|x_{-j}) \\ &= p(x_{-j})p(x_j|x_{-j})p(x_j'|x_{-j}) \\ &= p(x_{-j}, x_j')p(x_j|x_{-j}) \\ &= \answer{p(x_{-j},x_j')T(x_{-j},x_j|x_{-j},x_j')} \\ \end{aligned} $$

↩︎

Parameter learning in probabilistic graphical models

Sun, 07 Jan 2024 00:00:00 +0000

Parameter learning in Bayesian networks and Markov random fields

Cost of learning CRF parameters

Consider the process of gradient-ascent training for a conditional random field (CRF) log-linear model with $k$ features, given a data set $\mathcal{D}$ with $M$ instances. Assume for simplicity that the cost of computing a single feature over a single instance in our data set is constant, as is the cost of computing the expected value of each feature once we compute a marginal over the variables in its scope. Also assume that we can compute each required marginal in constant time after we have a calibrated clique tree. (Clique tree calibration is a smart way of reusing messages in the message passing algorithm for calculating marginals on a graphical model, but all you need to know is that once we finish the clique tree calibration, each required marginal can be computed in constant time).

Assume that we use clique tree calibration to compute the expected sufficient statistics in this model, and that the cost of running clique tree calibration is $c$. Assume that we need r iterations for the gradient process to converge. (We are using a batch algorithm, so each iteration means using all the data to calculate the gradients once) What is the cost of this procedure? (Choose one of the following answers and explain your rational)

Solution

A CRF is a MRF where variables $\{X_i\}$ are observed and variables $\{Y_i\}$ are hidden, and is structured to model the conditional distribution $P(Y|X)$. Equation (20.7) from Koller & Friedman¹ shows us the form of the gradient of the conditional log-likelihood: $$ \newcommand{\EE}{\mathbb{E}} \newcommand{\ind}{\mathbb{1}} \newcommand{\answertext}[1]{\textcolor{Green}{\fbox{#1}}} \newcommand{\answer}[1]{\answertext{$#1$}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\comment}[1]{\textcolor{gray}{\textrm{#1}}} \newcommand{\Prec}{\Sigma^{-1}} \newcommand{\inv}[1]{\frac{1}{#1}} \newcommand{\abs}[1]{\lvert{#1}\rvert} \newcommand{\norm}[1]{\lVert{#1}\rVert} \newcommand{\lr}[1]{\left(#1\right)} \newcommand{\lrb}[1]{\left[#1\right]} \newcommand{\lrbr}[1]{\lbrace#1\rbrace} \newcommand{\Bx}[0]{\mathbf{x}} \begin{aligned} \frac{\partial}{\partial \theta_i} \ell_{Y \mid X}(\theta: \mathcal{D}) &= \sum_{m=1}^M\left(f_i(y[m], x[m])-\EE_{\theta}\left[f_i \mid x[m]\right]\right) \\ &= \EE_\mathcal{D} f_i(y, x)-\sum_{m=1}^M\EE_{\theta}\left[f_i \mid x[m]\right] \end{aligned} $$

The left term contributes $\mathcal{O}(Mk)$ to the computation since it only needs to be done once, for each of $M$ samples and for each of $k$ features—one for each possible $x,y$ pair for the basic case where each feature $f$ is an indicator function $f_{x^i,y^j}(x,y)=\ind(x=x^i)\ind(y=y^j)$). It does not need to be done on every iteration since it does not depend on $\theta$.

The right term, however, does need to be done on each of $r$ iterations, summing over $M$ samples each time. Each sample requires separate calibration of the clique tree $\lr{\mathcal{O}(c)}$ and computation of the expectation of each of $k$ features, yielding $\mathcal{O}(rM(c+k))$. It is worth noting that the clique tree calibration cost $c$ and the feature count $k$ are both less than in an equivalent MRF, since conditioning on observations $X$ essentially reduces the size of the graph. Let’s refer to those reduced costs as $c'$ and $k'$, respectively.

Thus, the total computational cost of the algorithm is $\answer{\mathcal{O}\lr{Mk + rM\lr{c'+k'}}}$.

Cost of learning MRF parameters

Consider the process of gradient-ascent training for a log-linear Markov random field (MRF) model with $k$ features, given a data set $\mathcal{D}$ with $M$ instances. Assume for simplicity that the cost of computing a single feature over a single instance in our data set is constant, as is the cost of computing the expected value of each feature once we compute a marginal over the variables in its scope. Also assume that we can compute each required marginal in constant time after we have a calibrated clique tree.

Assume that we use clique tree calibration to compute the expected sufficient statistics in this model and that the cost of doing this is $c$. Also, assume that we need $r$ iterations for the gradient process to converge. What is the cost of this procedure?

Solution

From eq. (20.4) of Koller & Friedman¹, we have the gradient calculation

$$ \frac{\partial}{\partial \theta_i} \frac{1}{M} \ell(\theta: \mathcal{D})=\mathbb{E}_{\mathcal{D}}\left[f_i(\mathcal{X})\right]-\mathbb{E}_{\theta}\left[f_i\right]. $$

Again, the left term costs $\mathcal{O}(Mk)$, since it is computed just once by summing over samples and features.

The right term, however, costs $\mathcal{O}(r(c+k))$ since it is run on each of $r$ iterations and involves $c$ for calibration of the clique tree and $k$ for computing the expectation of each feature.

The total cost is thus $\answer{\mathcal{O}(Mk + r(c+k))}$.

Parameter learning in MNs vs BNs

Compared to learning parameters in Bayesian networks, learning in Markov networks is generally

Less difficult as we do not need to account for the directed nature of factors as we do in a Bayes Net.
More difficult because factors in MNs need not sum up to 1.
Equally difficult, though MN inference will be better by a constant factor difference in the computation time as we do not need to worry about directionality.
More difficult because we cannot use parallel optimization of subparts of our likelihood as we often can in BN learning.

Solution

$\answer{\textrm{D}}$: The fact that parameters are coupled together by the computation of the normalization constant means we cannot parallelize parameter learning as easily.

Maximum-likelihood training of $q$ minimizes KL divergence with empirical distribution

For a distribution $p(x, c)$ and an approximation $q(x, c)$, show that when $p(x, c)$ corresponds to the empirical distribution, finding that minimizes the Kullback-Leibler divergence

$$ KL(p(x,c) \parallel q(x,c)) $$

corresponds to maximum likelihood training of assuming i.i.d. data.

Solution

The empirical distribution is defined as $p(x,c)=m(x,c)/M$ where $m(x,c)$ is the number of times $(x,c)$ appears in the data and $M$ is the total number of samples. The likelihood is then

$$L(\theta: D) = p(D; \theta) = \prod_m q(x[m], c[m]),$$

which is maximized by maximizing the log-likelihood: $$ \begin{aligned} \argmax{\theta} L(\theta: D) &= \argmax{\theta} \ell(\theta: D) \\ &= \argmax{\theta} \sum_m \log q(x[m], c[m]) \\ &= \argmax{\theta} \sum_{x,c} m(x,c) \log q(x,c) \\ &= \argmin{\theta} - \sum_{x,c} m(x,c) \log q(x,c) \\ &= \argmin{\theta} - \sum_{x,c} \frac{m(x,c)}{M} \log q(x,c) \\ &= \argmin{\theta} - \sum_{x,c} p(x,c) \log q(x,c) \\ &= \argmin{\theta} \sum_{x,c} p(x,c) (-\log q(x,c)) \\ &= \argmin{\theta} \sum_{x,c} p(x,c) (-\log q(x,c)) + \log p(x,c) &\comment{since p doesn't depend on $\theta$} \\ &= \argmin{\theta} \sum_{x,c} p(x,c) \frac{\log p(x,c)}{\log q(x,c)} \\ &= \answer{D_{KL}(p(x,c) \parallel q(x,c))} & \blacksquare \\ \end{aligned} $$

↩︎ ↩︎

Learning maximum likelihood tree structure with the Chow-Liu algorithm

Fri, 05 Jan 2024 15:31:21 -0500

Write a function ChowLiu(X) -> A where X is a D by N data matrix containing a multivariate data point on each column that returns a Chow-Liu maximum likelihood tree for X. The tree structure is to be returned in the sparse matrix A.

The file ChowLiuData.mat contains a data matrix for 10 variables. Use your function to find the maximum-likelihood Chow-Liu tree, and draw a picture of the resulting DAG with edges oriented away from variable 1.

Solution

from scipy.io import loadmat

data = loadmat('ChowLiuData.mat')['X'].T - 1
print(data.shape)
data

(1000, 10)
array([[0, 1, 1, ..., 2, 1, 0],
[1, 1, 1, ..., 0, 2, 0],
[0, 0, 1, ..., 2, 1, 1],
...,
[1, 1, 1, ..., 0, 2, 0],
[1, 1, 0, ..., 0, 0, 0],
[1, 0, 1, ..., 2, 1, 0]], dtype=uint8)

The Chow-Liu algorithm simply computes the mutual information $I(X_i,X_j)$ between each pair of variables and constructs a maximum spanning tree where the edge weights are the mutual information.

import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

def chow_liu(X):
 M, N = X.shape
 K = X.max() + 1 # num discrete values each X can take
 # compute empirical p(x_i, x_j)
 p_pair = np.zeros((N, K, N, K))
 # and empirical p(x_i)
 p_single = np.zeros((N, K))
 for row in X:
 for i in range(N):
 p_single[i, row[i]] += 1
 for j in range(i, N):
 p_pair[i, row[i], j, row[j]] += 1
 if i != j:
 p_pair[j, row[j], i, row[i]] += 1
 p_pair /= M
 p_single /= M
 assert np.allclose(p_pair.sum(axis=(1, 3)), 1)
 assert p_single.shape == (N, K)
 assert np.allclose(p_single.sum(axis=1), 1)
 # compute all I_ij = I(X_i, X_j)
 with np.errstate(divide='ignore', invalid='ignore'):
 logterm = (np.log(p_pair) \
 - np.log(p_single[..., np.newaxis, np.newaxis]) \
 - np.log(p_single))
 # set these to constants, will be nullified assuming they come from
 # instances when p_pair=0
 assert np.all(p_pair[np.isnan(logterm)] == 0)
 logterm[np.isnan(logterm)] = 1 # from log0 - log0
 assert np.all(p_pair[np.isinf(logterm)] == 0)
 logterm[np.isinf(logterm)] = 1 # from log0 - ...
 I = (p_pair * logterm).sum(axis=(1, 3))
 assert I.shape == (N, N)

 # compute max spanning tree
 G = nx.Graph()
 G.add_nodes_from(range(1,N+1))
 for i in range(1, N+1):
 for j in range(i, N+1):
 G.add_edge(i, j, weight=I[i-1, j-1])
 T = nx.maximum_spanning_tree(G)
 # get directions from breadth-first search
 A = nx.bfs_tree(T, 1)
 return A
A = chow_liu(data)
d = nx.draw_planar(A, with_labels=True)

The directions are coincidental from the breadth-first search—the result is an undirected tree (Markov random field).

Expectation-maximization for a Markov chain mixture model

Fri, 12 May 2023 00:00:00 +0000

Assume that a sequence $v_1,\ldots,v_T \in \{1,\dots,V\}$ is generated by a Markov chain. For a single chain of length $T$, we have

$$ p(v_1,\dots,v_T) = p(v_1)\prod_{t=1}^{T-1} p(v_{t+1}|v_t) \newcommand{\EE}{\mathbb{E}} \newcommand{\ind}{\mathbb{1}} \newcommand{\answertext}[1]{\textcolor{Green}{\fbox{#1}}} \newcommand{\answer}[1]{\answertext{$#1$}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\comment}[1]{\textcolor{gray}{\textrm{#1}}} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\inv}[1]{\frac{1}{#1}} \newcommand{\abs}[1]{\lvert{#1}\rvert} \newcommand{\norm}[1]{\lVert{#1}\rVert} \newcommand{\lr}[1]{\left(#1\right)} \newcommand{\lrb}[1]{\left[#1\right]} \newcommand{\lrbr}[1]{\lbrace#1\rbrace} \newcommand{\Bx}[0]{\mathbf{x}} $$

For simplicity, we denote the sequence of visible variables as

$$ \vec v = \lr{v_1,\dots,v_T} $$

For a single Markov chain labelled by $h$,

$$ p(\vec v|h) = p(v_1|h)\prod_{t=1}^{T-1}p(v_{t+1}|v_t,h) $$

In total there are a set of $H$ such Markov chains $h=1,\dots,H$. The distribution on the visible variables is therefore

$$ p(\vec v) = \sum_{h=1}^H p(\vec v|h)p(h) $$

Deriving EM update equations

There are a set of training sequences, $\vec v^n,n=1,\dots,N$. Assuming that each sequence $\vec v^n$ is independently and identically drawn from a Markov chain mixture model with $H$ components, derive the Expectation Maximization algorithm for training this model.

Solution

E step

We need to compute the distribution of the hidden variables $h$ given the current parameters, which is equivalent to the expected “count” of each chain $h$ and each sample $n$. These counts are the sufficient statistics for the categorical distribution $p(h|v^n,\theta)$. We’ll later pretend this distribution doesn’t depend on $v^n$ during the M step, and thus represent this as $p(h|v^n,\theta)=q^n(h)=\tau_h^n$:

$$ \begin{aligned} \tau_h^n &= \EE_{p(h | v^n,\theta)} [\ind(h^n=h)] \\\\ &= p(h | v^n,\theta) \\\\ &= \frac{p(h, v^n | \theta)}{p(v^n|\theta)} \\\\ &= \frac{p(h, v^n | \theta)}{\sum_h p(h, v^n|\theta)} \\\\ &= \answer{\frac{p(h)p(v_1|h)\prod_{t=2}^{T}p(v_t|v_{t-1,h})} {\sum_h p(h)p(v_1|h)\prod_{t=2}^{T}p(v_t|v_{t-1,h})}} \\\\ \end{aligned} $$

At each iteration, these counts are computed and used to take expectations over $q^n(h)$. This is needed to derive the complete data likelihood, which is maximized in the M step.

M step

In the M step, we maximize the expected complete data log likelihood $f(\theta)$ with respect to $\theta$:

$$ \begin{aligned} f(\theta) &= \EE_{h \sim q} \log p(v^1,\dots,v^n,h^1,\dots,h^n|\theta) \\\\ &= \EE_{h \sim q} \sum_n \log p(v^n,h|\theta) \\\\ &= \sum_n \EE_{h \sim q^n} \log p(v^n,h|\theta) \\\\ &= \sum_n \sum_h q^n(h) \log p(v^n,h|\theta) \\\\ &= \sum_{n,h} \tau_h^n \log p(v^n,h|\theta) \\\\ &= \sum_{n,h} \tau_h^n \log p(h) + \log p(v_1^n|h) + \sum_{t=2}^T \log p(v_t^n|v_{t-1}^n,h) \\\\ &= \sum_{n,h} \tau_h^n \log \theta_h + \log \pi_{v_1^n|h} + \sum_{t=2}^T \log a_{v_t^n|v_{t-1}^n,h} \\\\ \end{aligned} $$

$\theta_h, \pi_{k|h}, a_{j|i,h}$ represent the chain prior, initial, and transition probabilities, respectively. We now we maximize $f(\theta)$ to derive the update step for each.

$$ \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\pdfd}[1]{\pdd{f}{#1}} \newcommand{\pdld}[1]{\pdd{\mathcal{L}}{#1}} $$

$\theta_h$ update

We need to introduce a Lagrange multiplier to enforce the constraint that $\sum_h \theta_h = 1$, and set $\pdld{\theta_h}=\pdld{\lambda}=0$:

$$ \hat{\theta}_h = \argmax{\theta_h} \mathcal{L}(\theta,\lambda) $$

$$ \mathcal{L}(\theta,\lambda) = f(\theta) - \lambda \lr{\sum_h \theta_h - 1} $$

Then we set $\pdld{\theta_h}=0$: $$ \pdld{\theta_h} = \frac{\partial}{\partial \theta_h} \sum_{n} \tau_h^n \log \theta_h - \lambda \theta_h = 0 $$ $$ \lambda = \frac{\sum_n \tau_h^n}{\hat{\theta}_h} \qquad \hat{\theta}_h = \frac{\sum_n \tau_h^n}{\lambda}$$

How do we get $\lambda$? We know that $\sum_h \theta_h = 1$, so we sum top and bottom over $h$: $$ \lambda = \frac{\sum_{n,h} \tau_h^n}{\sum_h \hat{\theta}_h} = \sum_{n,h} \tau_h^n = N $$

Hence,

$$ \answer{\hat{\theta}_h = \frac{\sum_n \tau_h^n}{N}} $$

$\pi_{k|h}$ update

We follow the same pattern as before, using a Lagrange multiplier:

$$ \hat{\pi}_{k|h} = \argmax{\pi_{k|h}} $$ $$ \hat{\pi}_{k|h} = \argmax{\pi_{k|h}} \mathcal{L}(\theta,\lambda) $$ $$ \mathcal{L}(\theta,\lambda) = f(\theta) - \lambda \lr{\sum_{k} \pi_{k|h} - 1} $$ $$ \pdld{\pi_{k|h}} = \pdd{}{\pi_{k|h}} \sum_{n} \tau_h^n \log \pi_{k|h} \ind(v_1^n=k) - \lambda \pi_{k|h} = 0 $$ $$ \lambda = \frac{\sum_n \tau_h^n \ind(v_1^n=k) }{\hat{\pi}_{k|h}} \qquad \hat{\pi}_{k|h} = \frac{\sum_n \tau_h^n\ind(v_1^n=k) }{\lambda}$$

Similar to before, taking our constraint into account, we sum over $k$ on top and bottom to solve for $\lambda$. $$ \lambda = \frac{\sum_{n,k} \tau_h^n \ind(v_1^n=k) }{\sum_k \hat{\pi}_{k|h}} = \sum_{n,k} \tau_h^n \ind(v_1^n=k) = \sum_n \tau_h^n $$ $$ \answer{\hat{\pi}_{k|h} = \frac{\sum_{n} \tau_h^n \ind(v_1^n=k)}{\sum_{n} \tau_h^n}} $$

$a_{j|i,h}$ update

The process is again similar to what we did before:

$$ \newcommand{\transprob}{a_{j|i,h}} \newcommand{\transprobhat}{\hat{a}_{j|i,h}} $$ $$ \transprob = \argmax{\transprob} \mathcal{L}(\theta,\lambda) $$ $$ \mathcal{L}(\theta,\lambda) = f(\theta) - \lambda \lr{\sum_{j} \transprob - 1} $$ $$ \pdld{\transprob} = \pdd{}{\transprob} \sum_{n} \tau_h^n \sum_{t=2}^T \ind(v_t^n=j,v_{t-1}^n=i) \log\transprob - \lambda \transprob = 0 $$ $$ \begin{aligned} \lambda &= \frac{\sum_n \tau_h^n \sum_{t=2}^T \ind(v_t^n=j,v_{t-1}^n=i)}{\transprobhat} \\\\ &= \sum_n \tau_h^n \sum_{t=2}^T \sum_j \ind(v_t^n=j,v_{t-1}^n=i) \\\\ &= \sum_n \tau_h^n \sum_{t=2}^T \ind(v_{t-1}^n=i) \\\\ \end{aligned} $$ $$ \answer{ \transprobhat = \frac{\sum_n \tau_h^n \sum_{t=2}^T \ind(v_t^n=j, v_{t-1}^n=i)} {\sum_n \tau_h^n \sum_{t=2}^T \ind(v_{t-1}^n=i)} } $$

Python code and application to biological sequences

The file sequences.mat contains a set of fictitious bio-sequences in a cell array sequences{n}(t). Thus sequences{3}(:) is the third sequence, GTCTCCTGCCCTCTCTGAAC, which consists of 20 timesteps. There are 20 such sequences in total. Your task is to cluster these sequences into two clusters, assuming that each cluster is modelled by a Markov chain. State which of the sequences belong together by assigning a sequence $\mathbf{v}^n$ to that state for which $p(h|\mathbf{v}^n)$ is highest.

Your solution must print and report the two clusters members (show which of the 20 sequences from the file sequences.mat are assigned to Cluster 1 and Cluster 2).

Solution

from scipy.io import loadmat
import numpy as np

seqs = loadmat('sequences.mat')['sequences'][0]
seqs = np.array([list(seq[0]) for seq in seqs])
seqs[seqs == 'A'] = 0
seqs[seqs == 'C'] = 1
seqs[seqs == 'G'] = 2
seqs[seqs == 'T'] = 3
seqs = seqs.astype(int)
N, T = seqs.shape
seqs

array([[1, 0, 3, 0, 2, 2, 1, 0, 3, 3, 1, 3, 0, 3, 2, 3, 2, 1, 3, 2],
[1, 1, 0, 2, 3, 3, 0, 1, 2, 2, 0, 1, 2, 1, 1, 2, 0, 0, 0, 2],
[3, 2, 2, 0, 0, 1, 1, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[2, 3, 1, 3, 1, 1, 3, 2, 1, 1, 1, 3, 1, 3, 1, 3, 2, 0, 0, 1],
[2, 3, 2, 1, 1, 3, 2, 2, 0, 1, 1, 3, 2, 0, 0, 0, 0, 2, 1, 1],
[1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 3, 1, 1, 2, 2, 2, 0, 0, 1, 2],
[0, 0, 0, 2, 3, 2, 1, 3, 1, 3, 2, 0, 0, 0, 0, 1, 3, 1, 0, 1],
[0, 1, 0, 3, 2, 0, 0, 1, 3, 0, 1, 0, 3, 0, 2, 3, 0, 3, 0, 0],
[2, 3, 3, 2, 2, 3, 1, 0, 2, 1, 0, 1, 0, 1, 2, 2, 0, 1, 3, 2],
[1, 1, 3, 1, 1, 1, 1, 3, 1, 1, 1, 1, 3, 3, 3, 1, 1, 3, 2, 1],
[1, 0, 1, 3, 0, 1, 2, 2, 1, 3, 0, 1, 1, 3, 2, 2, 2, 1, 0, 0],
[1, 2, 2, 3, 1, 1, 2, 3, 1, 1, 2, 0, 2, 2, 1, 0, 1, 3, 1, 2],
[3, 0, 0, 2, 3, 2, 3, 1, 1, 3, 1, 3, 2, 1, 3, 1, 1, 3, 0, 0],
[1, 0, 1, 1, 0, 3, 1, 0, 1, 1, 1, 3, 3, 2, 1, 3, 0, 0, 2, 2],
[0, 0, 0, 2, 0, 0, 1, 3, 1, 1, 1, 1, 3, 1, 1, 1, 3, 2, 1, 1],
[1, 0, 0, 0, 3, 2, 1, 1, 3, 1, 0, 1, 2, 1, 2, 3, 1, 3, 1, 0],
[2, 1, 1, 0, 0, 2, 1, 0, 2, 2, 2, 3, 1, 3, 1, 0, 0, 1, 3, 3],
[1, 0, 3, 2, 2, 0, 1, 3, 2, 1, 3, 1, 1, 0, 1, 0, 0, 0, 2, 2],
[0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 1, 1, 3, 0, 0, 2],
[2, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 2, 3, 1, 1, 3, 2, 2, 2, 3]])

H = 2
K = 4
def em_cluster():
 theta = np.random.rand(H)
 theta /= theta.sum()
 pi = np.random.rand(K, H)
 pi /= pi.sum(axis=0)
 A = np.random.rand(K, K, H) # i, j, h
 A /= A.sum(axis=1, keepdims=True)
 assert np.allclose(A.sum(axis=1), 1)

 iters_no_change = 0
 h_hat_prev = np.empty(N)
 tau_prev = np.empty((N, H))
 tau = np.ones_like(tau_prev)
 iters = 0
 # while iters_no_change < 10:
 while not np.allclose(tau_prev, tau):
 tau_prev = tau.copy()
 iters += 1
 # E-step
 # h n x h n x h
 p_hv = theta * pi[seqs[:, 0], :] * np.prod(A[seqs[:, :-1], seqs[:, 1:], :], axis=1)
 assert p_hv.shape == (N, H)
 tau = p_hv / p_hv.sum(axis=1, keepdims=True)
 h_hat = np.argmax(tau, axis=1)

 # M-step
 theta = tau.sum(axis=0) / N
 for k in range(K):
 pi[k, :] = np.sum(tau[seqs[:, 0] == k, :], axis=0) / np.sum(tau, axis=0)
 A = np.zeros_like(A)
 for n in range(N):
 for t in range(1, T):
 A[seqs[n, t-1], seqs[n, t], :] += tau[n, :]
 A /= np.sum(A, axis=1, keepdims=True)

 # stopping condition
 if not np.all(h_hat_prev == h_hat):
 iters_no_change = 0
 else:
 iters_no_change += 1
 h_hat_prev = h_hat.copy()

 print(f'converged in {iters} iterations')
 likelihood = theta[h_hat] * pi[seqs[:, 0], h_hat] * np.prod(A[seqs[:, :-1], seqs[:, 1:], h_hat[:, np.newaxis]], axis=1)
 logl = np.log(likelihood).sum()
 return h_hat, logl

h_hat = em_cluster()
h_hat

converged in 11 iterations
(array([0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0]),
-497.82024107176716)

# repeat and keep highest likelihood model
best_logl = -np.inf
for i in range(20):
 h_hat_i, logl_i = em_cluster()
 if logl_i > best_logl:
 h_hat = h_hat_i
 best_logl = logl_i

print(f'best log likelihood: {best_logl}')
clusters = {f'Cluster {i+1}': np.where(h_hat == i)[0]+1 for i in [0, 1]}
for k, v in clusters.items():
 print(f'{k}:', *v, sep=' ')

converged in 12 iterations
converged in 26 iterations
converged in 63 iterations
converged in 21 iterations
converged in 8 iterations
converged in 11 iterations
converged in 9 iterations
converged in 12 iterations
converged in 17 iterations
converged in 27 iterations
converged in 20 iterations
converged in 11 iterations
converged in 12 iterations
converged in 19 iterations
converged in 10 iterations
converged in 25 iterations
converged in 14 iterations
converged in 25 iterations
converged in 24 iterations
converged in 24 iterations
best log likelihood: -483.6486774197766
Cluster 1: 1 2 6 8 9 11 12 14 16 17 18
Cluster 2: 3 4 5 7 10 13 15 19 20

# print sequences
for k, v in clusters.items():
 print(f'{k}:')
 for i in v:
 print(''.join(['ACGT'[i] for i in seqs[i-1]]))
 print('')

Cluster 1:
CATAGGCATTCTATGTGCTG
CCAGTTACGGACGCCGAAAG
CGGCCGCGCCTCCGGGAACG
ACATGAACTACATAGTATAA
GTTGGTCAGCACACGGACTG
CACTACGGCTACCTGGGCAA
CGGTCCGTCCGAGGCACTCG
CACCATCACCCTTGCTAAGG
CAAATGCCTCACGCGTCTCA
GCCAAGCAGGGTCTCAACTT
CATGGACTGCTCCACAAAGG
Cluster 2:
TGGAACCTTAAAAAAAAAAA
GTCTCCTGCCCTCTCTGAAC
GTGCCTGGACCTGAAAAGCC
AAAGTGCTCTGAAAACTCAC
CCTCCCCTCCCCTTTCCTGC
TAAGTGTCCTCTGCTCCTAA
AAAGAACTCCCCTCCCTGCC
AAAAAAACGAAAAACCTAAG
GCGTAAAAAAAGTCCTGGGT

Learning edge direction in a Bayesian network model

Tue, 09 May 2023 00:00:00 +0000

Our interest here is to discuss a method to learn the direction of an edge in a belief network. Consider a distribution

$$ P(x,y | \theta,M_{y\to x}) = P(x|y,\theta_{x|y})P(y|\theta_y) $$

where $\theta$ are the parameters of the conditional probability tables. For a prior $p(\theta)=p(\theta_{x|y})p(\theta_y)$ and i.i.d. data $\mathcal{D}= \lbrace x^n,y^n,n=1,...,N\rbrace$, the likelihood of the data is

$$ p(\mathcal{D}|M_{y\to x}) = \int_\theta p(\theta_{x|y})p({\theta_y}) \prod_n p(x^n|y^n,\theta_{x|y})p(y^n|\theta_y) $$

For binary variables $x\in \lbrace 0,1\rbrace,y\in\lbrace 0,1\rbrace$,

$$ p(y=1|\theta_y) = \theta_y, \qquad p(x=1|y,\theta_{x|y})=\theta_{1|y} $$

and Beta distribution priors

$$ p(\theta_y) = B(\theta_y|\alpha,\beta),\qquad p(\theta_{1|y}) = B(\theta_{1|y}|\alpha_{1|y},\beta_{1|y}), \qquad p(\theta_{1|1}\theta_{1|0} = p(\theta_{1|1})p(\theta_{1|0}). $$

Expanding the formula for $P(\mathcal{D}|M_{y\to x})$

Find a formula to estimate

$$ P(\mathcal{D}|M_{y\to x}) $$

in terms of parameters of Beta priors and the actual counts in training data.

Figure 1: Two alternative structures for a belief network

Solution

$$ \begin{aligned} P(\mathcal{D}|M_{y\to x}) &= \int_\theta p(\theta_y)p(\theta_{x|y})\prod_n p(y^n|\theta_y)p(x^n|y^n,\theta_{x|y})d\theta \\ &= \int_\theta \left( \begin{aligned} &B(\alpha_{y},\beta_{y})\theta_{y}^{\alpha-1}\left(1-\theta_{y}\right)^{\beta-1} B(\alpha_{x|1},\beta_{x|1})\theta_{x|1}^{\alpha_{x|1}-1}\left(1-\theta_{x|1}\right)^{\beta_{x|1}-1} B(\alpha_{x|0},\beta_{x|0})\theta_{x|0}^{\alpha_{x|0}-1}\left(1-\theta_{x|0}\right)^{\beta_{x|0}-1} \\ &*\theta_{y}^{N_{y=1}}\left(1-\theta_{y}\right)^{N_{y=0}} \theta_{1|1}^{N_{1|y=1}}\left(1-\theta_{1|1}\right)^{N_{0|y=1}} \theta_{1|0}^{N_{1|y=0}}\left(1-\theta_{1|0}\right)^{N_{0|y=0}} \\ \end{aligned} \right) d\theta \\ &= \int_\theta \left( \begin{aligned} &B(\alpha_{y},\beta_{y})\theta_{y}^{N_{y=1}+\alpha-1}\left(1-\theta_{y}\right)^{N_{y=0}+\beta-1} \\ &B(\alpha_{x|1},\beta_{x|1})\theta_{x|1}^{N_{1|y=1}+\alpha_{x|1}-1}\left(1-\theta_{x|1}\right)^{N_{0|y=1}+\beta_{x|1}-1} \\ &B(\alpha_{x|0},\beta_{x|0})\theta_{x|0}^{N_{1|y=0}+\alpha_{x|0}-1}\left(1-\theta_{x|0}\right)^{N_{0|y=0}+\beta_{x|0}-1} \\ \end{aligned} \right) d\theta \\ &= \left(\begin{aligned} &\int_{\theta_y}B(\alpha_{y},\beta_{y})\theta_{y}^{N_{y=1}+\alpha-1}\left(1-\theta_{y}\right)^{N_{y=0}+\beta-1}d\theta_y \\ &*\int_{\theta_{x|1}}B(\alpha_{x|1},\beta_{x|1})\theta_{x|1}^{N_{1|y=1}+\alpha_{x|1}-1}\left(1-\theta_{x|1}\right)^{N_{0|y=1}+\beta_{x|1}-1}d{\theta_{x|1}} \\ &*\int_{\theta_{x|0}}B(\alpha_{x|0},\beta_{x|0})\theta_{x|0}^{N_{1|y=0}+\alpha_{x|0}-1}\left(1-\theta_{x|0}\right)^{N_{0|y=0}+\beta_{x|0}-1}d{\theta_{x|0}} \\ \end{aligned}\right) \\ &= \textcolor{Green}{\fbox{$B(\alpha+N_{y=1},\beta+N_{y=0})B(\alpha_{x|1}+N_{1|y=1},\beta_{x|1}+N_{0|y=1})B(\alpha_{x|0}+N_{1|y=0},\beta_{x|0}+N_{0|y=0})$}} \\ \end{aligned} $$

Expanding the formula for $P(\mathcal{D}|M_{x\to y})$

Now derive a similar expression as in part 1 for the model with the edge direction reversed,

$$ P(\mathcal{D}|M_{x\to y}) $$

assuming the same values (as in part 1, except that the role of $x$ and $y$ are switched) for all hyperparameters of this reverse model.

Solution

Same as above, but switching $x$ and $y$:

$$ \begin{aligned} P(\mathcal{D}|M_{x\to y}) &= \textcolor{Green}{\fbox{$B(\alpha+N_{x=1},\beta+N_{x=0})B(\alpha_{y|1}+N_{1|x=1},\beta_{y|1}+N_{0|x=1})B(\alpha_{y|0}+N_{1|x=0},\beta_{y|0}+N_{0|x=0})$}} \\\\ \end{aligned} $$

Deriving the Bayes factor

Using the above, derive a simple expression for the Bayes’ factor, which computes the ratio of probability you found in part 1 to probability in part 2:

$$ \frac{P(\mathcal{D}|M_{y\to x})}{P(\mathcal{D}|M_{x\to y})} $$

Solution

$$ \begin{aligned} &\frac{P(\mathcal{D}|M_{y\to x})}{P(\mathcal{D}|M_{x\to y})} = \\\\ &\quad\textcolor{Green}{\fbox{$\frac{ B(\alpha+N_{y=1},\beta+N_{y=0})B(\alpha_{x|1}+N_{1|y=1},\beta_{x|1}+N_{0|y=1})B(\alpha_{x|0}+N_{1|y=0},\beta_{x|0}+N_{0|y=0}) }{ B(\alpha+N_{x=1},\beta+N_{x=0})B(\alpha_{y|1}+N_{1|x=1},\beta_{y|1}+N_{0|x=1})B(\alpha_{y|0}+N_{1|x=0},\beta_{y|0}+N_{0|x=0}) }$}} \end{aligned} $$

Leveraging priors to choose the more deterministic model

By choosing appropriate hyperparameters (i.e., beta priors), give a numerical example that illustrates how to encode the heuristic that if the table $p(x|y)p(y)$ is ‘more deterministic’ than $p(y|x)p(x)$, then we should prefer the model $M_{y\to x}$. (Note that when we say is ‘more deterministic’ that implies that given $y$, the outcome of $x$ is quite certain (e.g., to be zero).)

Compute the Bayes factor for the situation

	y=0	y=1
x=0	10	10
x=1	0	0

And also compare the resulting Bayes factor in the above situation but use a uniform prior for all priors instead.

Solution

We use the same hyperparameters for both models (e.g., the same $\alpha$ and $\beta$ for both $p(y)$ and $p(x)$), for a fair, uninformed comparison. Using $\alpha_{a|b}=\beta_{a|b}<1/2$ corresponds to favoring $\theta_{a|b}$ close to 0 or 1, that is, more deterministic models. $\alpha=\beta=\frac{1}{2}$ is what is called the Jeffreys prior, and is considered the truly ‘uninformed’ prior, favoring neither deterministic or entropic distributions.

This image from Wikipedia illustrates the effect of different priors on the beta distribution:

import numpy as np
from scipy.special import beta as B, gamma as G

def bayes_factor(xy00, xy01, xy10, xy11, alpha1, beta1, alpha11, beta01, alpha10, beta00):
 N = xy00 + xy01 + xy10 + xy11
 Ny1 = xy01 + xy11
 Ny0 = xy00 + xy10
 Nx1 = xy10 + xy11
 Nx0 = xy00 + xy01
 res = B(alpha1 + Ny1, beta1 + Ny0) / B(alpha1 + Nx1, beta1 + Nx0) \
 * B(alpha11 + xy11, beta01 + xy01) / B(alpha11 + xy11, beta01 + xy10) \
 * B(alpha10 + xy10, beta00 + xy00) / B(alpha10 + xy01, beta00 + xy00)
 return res

def bayes_test(alphabeta, alphabeta_cond):
 res = bayes_factor(10, 10, 0, 0, alphabeta, alphabeta, alphabeta_cond, alphabeta_cond, alphabeta_cond, alphabeta_cond)
 if np.isclose(res, 1, rtol=1e-3):
 print(f'Bayes factor = {res}, neither model preferred')
 elif res > 1:
 print(f'Bayes factor = {res}, My->x preferred')
 elif res < 1:
 print(f'Bayes factor = {res}, Mx->y preferred')

bayes_test(.5, .2)

Bayes factor = 1.4689932412843798, My->x preferred

This is as expected.

We get a stronger preference for $M_{y\to x}$ if we use a uniform rather than a Jeffreys prior for $\alpha,\beta$, since the marginal distribution of $y$ is more entropic (uniform) than that of $x$ (there are no samples with $x=1$). This is what we should use to make our decision, since it means we don’t care about whether $\theta$ produces stochastic or deterministic marginal $x$ and $y$.

bayes_test(1, .2)

Bayes factor = 5.932237498634842, My->x preferred

We can confirm that the Jeffreys prior does indeed prefer neither model:

bayes_test(1, .5)

Bayes factor = 1.0, neither model preferred

If we use the Jeffreys prior for both parent and conditional parameters, $M_{x\to y}$ is preferred since it better balances stochastic and deterministic distributions.

bayes_test(.5, .5)

Bayes factor = 0.24762886543609094, Mx->y preferred

Finally, using uniform priors as requested, $M_{x\to y}$ is preferred, as we would expect:

bayes_test(1, 1)

Bayes factor = 0.17355371900826447, Mx->y preferred