Parameter learning in probabilistic graphical models
Taken from coursework for ECE 7751: Graphical Models in Machine Learning, taught by Faramarz Fekri at Georgia Tech, Spring 2023
Parameter learning in Bayesian networks and Markov random fields
Cost of learning CRF parameters
Consider the process of gradient-ascent training for a conditional random field (CRF) log-linear model with $k$ features, given a data set $\mathcal{D}$ with $M$ instances. Assume for simplicity that the cost of computing a single feature over a single instance in our data set is constant, as is the cost of computing the expected value of each feature once we compute a marginal over the variables in its scope. Also assume that we can compute each required marginal in constant time after we have a calibrated clique tree. (Clique tree calibration is a smart way of reusing messages in the message passing algorithm for calculating marginals on a graphical model, but all you need to know is that once we finish the clique tree calibration, each required marginal can be computed in constant time).
Assume that we use clique tree calibration to compute the expected sufficient statistics in this model, and that the cost of running clique tree calibration is $c$. Assume that we need r iterations for the gradient process to converge. (We are using a batch algorithm, so each iteration means using all the data to calculate the gradients once) What is the cost of this procedure? (Choose one of the following answers and explain your rational)
Solution
A CRF is a MRF where variables ${X_i}$ are observed and variables ${Y_i}$ are hidden, and is structured to model the conditional distribution $P(Y|X)$. Equation (20.7) from Koller & Friedman1 shows us the form of the gradient of the conditional log-likelihood: $$ \newcommand{\EE}{\mathbb{E}} \newcommand{\ind}{\mathbb{1}} \newcommand{\answertext}[1]{\textcolor{Green}{\fbox{#1}}} \newcommand{\answer}[1]{\answertext{$#1$}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\comment}[1]{\textcolor{gray}{\textrm{#1}}} \newcommand{\Prec}{\Sigma^{-1}} \newcommand{\inv}[1]{\frac{1}{#1}} \newcommand{\abs}[1]{\lvert{#1}\rvert} \newcommand{\norm}[1]{\lVert{#1}\rVert} \newcommand{\lr}[1]{\left(#1\right)} \newcommand{\lrb}[1]{\left[#1\right]} \newcommand{\lrbr}[1]{\lbrace#1\rbrace} \newcommand{\Bx}[0]{\mathbf{x}} \begin{aligned} \frac{\partial}{\partial \theta_i} \ell_{Y \mid X}(\theta: \mathcal{D}) &= \sum_{m=1}^M\left(f_i(y[m], x[m])-\EE_{\theta}\left[f_i \mid x[m]\right]\right) \\ &= \EE_\mathcal{D} f_i(y, x)-\sum_{m=1}^M\EE_{\theta}\left[f_i \mid x[m]\right] \end{aligned} $$
The left term contributes $\mathcal{O}(Mk)$ to the computation since it only needs to be done once, for each of $M$ samples and for each of $k$ features—one for each possible $x,y$ pair for the basic case where each feature $f$ is an indicator function $f_{x^i,y^j}(x,y)=\ind(x=x^i)\ind(y=y^j)$). It does not need to be done on every iteration since it does not depend on $\theta$.
The right term, however, does need to be done on each of $r$ iterations, summing over $M$ samples each time. Each sample requires separate calibration of the clique tree $\lr{\mathcal{O}(c)}$ and computation of the expectation of each of $k$ features, yielding $\mathcal{O}(rM(c+k))$. It is worth noting that the clique tree calibration cost $c$ and the feature count $k$ are both less than in an equivalent MRF, since conditioning on observations $X$ essentially reduces the size of the graph. Let’s refer to those reduced costs as $c’$ and $k’$, respectively.
Thus, the total computational cost of the algorithm is $\answer{\mathcal{O}\lr{Mk + rM\lr{c’+k’}}}$.
Cost of learning MRF parameters
Consider the process of gradient-ascent training for a log-linear Markov random field (MRF) model with $k$ features, given a data set $\mathcal{D}$ with $M$ instances. Assume for simplicity that the cost of computing a single feature over a single instance in our data set is constant, as is the cost of computing the expected value of each feature once we compute a marginal over the variables in its scope. Also assume that we can compute each required marginal in constant time after we have a calibrated clique tree.
Assume that we use clique tree calibration to compute the expected sufficient statistics in this model and that the cost of doing this is $c$. Also, assume that we need $r$ iterations for the gradient process to converge. What is the cost of this procedure?
Solution
From eq. (20.4) of Koller & Friedman1, we have the gradient calculation
$$ \frac{\partial}{\partial \theta_i} \frac{1}{M} \ell(\theta: \mathcal{D})=\mathbb{E}_{\mathcal{D}}\left[f_i(\mathcal{X})\right]-\mathbb{E}_{\theta}\left[f_i\right]. $$Again, the left term costs $\mathcal{O}(Mk)$, since it is computed just once by summing over samples and features.
The right term, however, costs $\mathcal{O}(r(c+k))$ since it is run on each of $r$ iterations and involves $c$ for calibration of the clique tree and $k$ for computing the expectation of each feature.
The total cost is thus $\answer{\mathcal{O}(Mk + r(c+k))}$.
Parameter learning in MNs vs BNs
Compared to learning parameters in Bayesian networks, learning in Markov networks is generally
- Less difficult as we do not need to account for the directed nature of factors as we do in a Bayes Net.
- More difficult because factors in MNs need not sum up to 1.
- Equally difficult, though MN inference will be better by a constant factor difference in the computation time as we do not need to worry about directionality.
- More difficult because we cannot use parallel optimization of subparts of our likelihood as we often can in BN learning.
Solution
$\answer{\textrm{D}}$: The fact that parameters are coupled together by the computation of the normalization constant means we cannot parallelize parameter learning as easily.
Maximum-likelihood training of $q$ minimizes KL divergence with empirical distribution
For a distribution $p(x, c)$ and an approximation $q(x, c)$, show that when $p(x, c)$ corresponds to the empirical distribution, finding that minimizes the Kullback-Leibler divergence
$$ KL(p(x,c) \parallel q(x,c)) $$
corresponds to maximum likelihood training of assuming i.i.d. data.
Solution
The empirical distribution is defined as $p(x,c)=m(x,c)/M$ where $m(x,c)$ is the number of times $(x,c)$ appears in the data and $M$ is the total number of samples. The likelihood is then $$L(\theta: D) = p(D; \theta) = \prod_m q(x[m], c[m]),$$ which is maximized by maximizing the log-likelihood: $$ \begin{aligned} \argmax{\theta} L(\theta: D) &= \argmax{\theta} \ell(\theta: D) \\ &= \argmax{\theta} \sum_m \log q(x[m], c[m]) \\ &= \argmax{\theta} \sum_{x,c} m(x,c) \log q(x,c) \\ &= \argmin{\theta} - \sum_{x,c} m(x,c) \log q(x,c) \\ &= \argmin{\theta} - \sum_{x,c} \frac{m(x,c)}{M} \log q(x,c) \\ &= \argmin{\theta} - \sum_{x,c} p(x,c) \log q(x,c) \\ &= \argmin{\theta} \sum_{x,c} p(x,c) (-\log q(x,c)) \\ &= \argmin{\theta} \sum_{x,c} p(x,c) (-\log q(x,c)) + \log p(x,c) &\comment{since p doesn't depend on $\theta$} \\ &= \argmin{\theta} \sum_{x,c} p(x,c) \frac{\log p(x,c)}{\log q(x,c)} \\ &= \answer{D_{KL}(p(x,c) \parallel q(x,c))} & \blacksquare \\ \end{aligned} $$