Learning edge direction in a Bayesian network model

May 9, 2023 · 4 min read
blog coursework

Our interest here is to discuss a method to learn the direction of an edge in a belief network. Consider a distribution

$$ P(x,y | \theta,M_{y\to x}) = P(x|y,\theta_{x|y})P(y|\theta_y) $$

where $\theta$ are the parameters of the conditional probability tables. For a prior $p(\theta)=p(\theta_{x|y})p(\theta_y)$ and i.i.d. data $\mathcal{D}= \lbrace x^n,y^n,n=1,...,N\rbrace$, the likelihood of the data is

$$ p(\mathcal{D}|M_{y\to x}) = \int_\theta p(\theta_{x|y})p({\theta_y}) \prod_n p(x^n|y^n,\theta_{x|y})p(y^n|\theta_y) $$

For binary variables $x\in \lbrace 0,1\rbrace,y\in\lbrace 0,1\rbrace$,

$$ p(y=1|\theta_y) = \theta_y, \qquad p(x=1|y,\theta_{x|y})=\theta_{1|y} $$

and Beta distribution priors

$$ p(\theta_y) = B(\theta_y|\alpha,\beta),\qquad p(\theta_{1|y}) = B(\theta_{1|y}|\alpha_{1|y},\beta_{1|y}), \qquad p(\theta_{1|1}\theta_{1|0} = p(\theta_{1|1})p(\theta_{1|0}). $$

Expanding the formula for $P(\mathcal{D}|M_{y\to x})$

Find a formula to estimate

$$ P(\mathcal{D}|M_{y\to x}) $$

in terms of parameters of Beta priors and the actual counts in training data.

Gxxyyx-\>yx1xy1yy1-\>x1

Figure 1: Two alternative structures for a belief network

Solution

$$ \begin{aligned} P(\mathcal{D}|M_{y\to x}) &= \int_\theta p(\theta_y)p(\theta_{x|y})\prod_n p(y^n|\theta_y)p(x^n|y^n,\theta_{x|y})d\theta \\ &= \int_\theta \left( \begin{aligned} &B(\alpha_{y},\beta_{y})\theta_{y}^{\alpha-1}\left(1-\theta_{y}\right)^{\beta-1} B(\alpha_{x|1},\beta_{x|1})\theta_{x|1}^{\alpha_{x|1}-1}\left(1-\theta_{x|1}\right)^{\beta_{x|1}-1} B(\alpha_{x|0},\beta_{x|0})\theta_{x|0}^{\alpha_{x|0}-1}\left(1-\theta_{x|0}\right)^{\beta_{x|0}-1} \\ &*\theta_{y}^{N_{y=1}}\left(1-\theta_{y}\right)^{N_{y=0}} \theta_{1|1}^{N_{1|y=1}}\left(1-\theta_{1|1}\right)^{N_{0|y=1}} \theta_{1|0}^{N_{1|y=0}}\left(1-\theta_{1|0}\right)^{N_{0|y=0}} \\ \end{aligned} \right) d\theta \\ &= \int_\theta \left( \begin{aligned} &B(\alpha_{y},\beta_{y})\theta_{y}^{N_{y=1}+\alpha-1}\left(1-\theta_{y}\right)^{N_{y=0}+\beta-1} \\ &B(\alpha_{x|1},\beta_{x|1})\theta_{x|1}^{N_{1|y=1}+\alpha_{x|1}-1}\left(1-\theta_{x|1}\right)^{N_{0|y=1}+\beta_{x|1}-1} \\ &B(\alpha_{x|0},\beta_{x|0})\theta_{x|0}^{N_{1|y=0}+\alpha_{x|0}-1}\left(1-\theta_{x|0}\right)^{N_{0|y=0}+\beta_{x|0}-1} \\ \end{aligned} \right) d\theta \\ &= \left(\begin{aligned} &\int_{\theta_y}B(\alpha_{y},\beta_{y})\theta_{y}^{N_{y=1}+\alpha-1}\left(1-\theta_{y}\right)^{N_{y=0}+\beta-1}d\theta_y \\ &*\int_{\theta_{x|1}}B(\alpha_{x|1},\beta_{x|1})\theta_{x|1}^{N_{1|y=1}+\alpha_{x|1}-1}\left(1-\theta_{x|1}\right)^{N_{0|y=1}+\beta_{x|1}-1}d{\theta_{x|1}} \\ &*\int_{\theta_{x|0}}B(\alpha_{x|0},\beta_{x|0})\theta_{x|0}^{N_{1|y=0}+\alpha_{x|0}-1}\left(1-\theta_{x|0}\right)^{N_{0|y=0}+\beta_{x|0}-1}d{\theta_{x|0}} \\ \end{aligned}\right) \\ &= \textcolor{Green}{\fbox{$B(\alpha+N_{y=1},\beta+N_{y=0})B(\alpha_{x|1}+N_{1|y=1},\beta_{x|1}+N_{0|y=1})B(\alpha_{x|0}+N_{1|y=0},\beta_{x|0}+N_{0|y=0})$}} \\ \end{aligned} $$

Expanding the formula for $P(\mathcal{D}|M_{x\to y})$

Now derive a similar expression as in part 1 for the model with the edge direction reversed,

$$ P(\mathcal{D}|M_{x\to y}) $$

assuming the same values (as in part 1, except that the role of $x$ and $y$ are switched) for all hyperparameters of this reverse model.

Solution

Same as above, but switching $x$ and $y$:

$$ \begin{aligned} P(\mathcal{D}|M_{x\to y}) &= \textcolor{Green}{\fbox{$B(\alpha+N_{x=1},\beta+N_{x=0})B(\alpha_{y|1}+N_{1|x=1},\beta_{y|1}+N_{0|x=1})B(\alpha_{y|0}+N_{1|x=0},\beta_{y|0}+N_{0|x=0})$}} \\\\ \end{aligned} $$

Deriving the Bayes factor

Using the above, derive a simple expression for the Bayes’ factor, which computes the ratio of probability you found in part 1 to probability in part 2:

$$ \frac{P(\mathcal{D}|M_{y\to x})}{P(\mathcal{D}|M_{x\to y})} $$

Solution

$$ \begin{aligned} &\frac{P(\mathcal{D}|M_{y\to x})}{P(\mathcal{D}|M_{x\to y})} = \\\\ &\quad\textcolor{Green}{\fbox{$\frac{ B(\alpha+N_{y=1},\beta+N_{y=0})B(\alpha_{x|1}+N_{1|y=1},\beta_{x|1}+N_{0|y=1})B(\alpha_{x|0}+N_{1|y=0},\beta_{x|0}+N_{0|y=0}) }{ B(\alpha+N_{x=1},\beta+N_{x=0})B(\alpha_{y|1}+N_{1|x=1},\beta_{y|1}+N_{0|x=1})B(\alpha_{y|0}+N_{1|x=0},\beta_{y|0}+N_{0|x=0}) }$}} \end{aligned} $$

Leveraging priors to choose the more deterministic model

By choosing appropriate hyperparameters (i.e., beta priors), give a numerical example that illustrates how to encode the heuristic that if the table $p(x|y)p(y)$ is ‘more deterministic’ than $p(y|x)p(x)$, then we should prefer the model $M_{y\to x}$. (Note that when we say is ‘more deterministic’ that implies that given $y$, the outcome of $x$ is quite certain (e.g., to be zero).)

Compute the Bayes factor for the situation

y=0y=1
x=01010
x=100

And also compare the resulting Bayes factor in the above situation but use a uniform prior for all priors instead.

Solution

We use the same hyperparameters for both models (e.g., the same $\alpha$ and $\beta$ for both $p(y)$ and $p(x)$), for a fair, uninformed comparison. Using $\alpha_{a|b}=\beta_{a|b}<1/2$ corresponds to favoring $\theta_{a|b}$ close to 0 or 1, that is, more deterministic models. $\alpha=\beta=\frac{1}{2}$ is what is called the Jeffreys prior, and is considered the truly ‘uninformed’ prior, favoring neither deterministic or entropic distributions.

This image from Wikipedia illustrates the effect of different priors on the beta distribution:

Figure 2: source: Wikipedia
import numpy as np
from scipy.special import beta as B, gamma as G

def bayes_factor(xy00, xy01, xy10, xy11, alpha1, beta1, alpha11, beta01, alpha10, beta00):
    N = xy00 + xy01 + xy10 + xy11
    Ny1 = xy01 + xy11
    Ny0 = xy00 + xy10
    Nx1 = xy10 + xy11
    Nx0 = xy00 + xy01
    res = B(alpha1 + Ny1, beta1 + Ny0) / B(alpha1 + Nx1, beta1 + Nx0) \
        * B(alpha11 + xy11, beta01 + xy01) / B(alpha11 + xy11, beta01 + xy10) \
        * B(alpha10 + xy10, beta00 + xy00) / B(alpha10 + xy01, beta00 + xy00)
    return res

def bayes_test(alphabeta, alphabeta_cond):
    res = bayes_factor(10, 10, 0, 0, alphabeta, alphabeta, alphabeta_cond, alphabeta_cond, alphabeta_cond, alphabeta_cond)
    if np.isclose(res, 1, rtol=1e-3):
      print(f'Bayes factor = {res}, neither model preferred')
    elif res > 1:
      print(f'Bayes factor = {res}, My->x preferred')
    elif res < 1:
      print(f'Bayes factor = {res}, Mx->y preferred')

bayes_test(.5, .2)
Bayes factor = 1.4689932412843798, My->x preferred

This is as expected.

We get a stronger preference for $M_{y\to x}$ if we use a uniform rather than a Jeffreys prior for $\alpha,\beta$, since the marginal distribution of $y$ is more entropic (uniform) than that of $x$ (there are no samples with $x=1$). This is what we should use to make our decision, since it means we don’t care about whether $\theta$ produces stochastic or deterministic marginal $x$ and $y$.

bayes_test(1, .2)
Bayes factor = 5.932237498634842, My->x preferred

We can confirm that the Jeffreys prior does indeed prefer neither model:

bayes_test(1, .5)
Bayes factor = 1.0, neither model preferred

If we use the Jeffreys prior for both parent and conditional parameters, $M_{x\to y}$ is preferred since it better balances stochastic and deterministic distributions.

bayes_test(.5, .5)
Bayes factor = 0.24762886543609094, Mx->y preferred

Finally, using uniform priors as requested, $M_{x\to y}$ is preferred, as we would expect:

bayes_test(1, 1)
Bayes factor = 0.17355371900826447, Mx->y preferred
Kyle Johnsen
Authors
PhD Candidate
Biomedical Engineering

I’m a PhD student advised by Chris Rozell at Georgia Tech, developing advanced closed-loop optogenetic control techniques for neuroscience. Specifically, I am applying optimal, model-based control to precisely perturb population dynamics.

I’m fascinated by the biological computational principles and how we can exploit them for next-generation AI/ML technologies. And vice-versa: how we can apply modern ML to neurotech/neuroscience, especially for building life- and time-saving brain-computer interfaces. I am looking for an industry research scientist role post-graduation.