Approximate Inference: The Big Picture
- Variational Inference
- Mean field (innter approximation)
- Loopy BP (outer approximation)
- Kikuchi and variants (tighter outer approximation)
- Expectation Propagation (reverse KL)
- Sampling
- Monte Carlo
- Sequential Monte Carlo
- Markove Chain Monte Carlo
- Hybrid Monte Carlo
Variational Methods
“Variational”: optimization-based formulations.
- quantity of interest as the solution to an optimization problem.
- approximate the desired solution by relaxing/approximating the intractable optimization problem.
Examples
- Courant-Fischer for eigenvalues
$$
\lambda_{\max} (A) = \max_{\lVert x \rVert_2 = 1} x^T A x
$$
Variational Inference: High-level Idea
- Inference: answer queries of $ P $
- Direct inference on $ P $ is often intractable.
- Indirect approach:
- Project $ P $ to a tractable family of distribution $ Q $.
- Perform inference on $ Q $.
$ \Rightarrow $ requires a measure of distance!
- A convenient choice: $ KL(Q || P) $
Exact Inference Revisited
Casting exact inference as an optimization problem.
Given factorized distribution
$$
P_{\Phi} (\mathcal{X}) = \frac{1}{Z} \prod_{\phi \in \Phi} \phi(U_\phi)
$$
- $ U_\phi = Scope[\phi] \subseteq \mathcal{X} $: the scope of each factor.
Goal: Answering queries about $ P_{\Phi} $
- Marginal probabilities of variables,
- partition function $ Z $.
Recall:
- Clique tree BP results in a calibrated cluster tree $ \Rightarrow $ all pairs of adjacent cliques are calibrated (by definition).
- Calibrated set of beliefs $ Q $ for the cluster tree represents a distribution $ P_\Phi (\mathcal{X}) $.
ℹ️
Can view Exact Inference as searching for a calibrated
distribution $ Q $ that matches $ P_\Phi \rightarrow $ minimizes relative entropy $ \mathbb{D}(Q || P_\Phi) $
CTree-Optimize-KL
Given a set of beliefs
$$
Q = \{ \beta_i : i \in V_{\mathcal{T}} \} \cup \{ \mu_{i,j} : (i-j) \in E_{\mathcal{T}} \}
$$
- $ \mathcal{T} $: clique tree structure for $ P_\Phi $.
- $ \beta_i $: beliefs over cluster $ C_i $.
- $ \mu_{i, j} $: beliefs over sepset $ S_{i, j} $.
The set of beliefs in $ T $ defines a distribution $ Q $ due to the calibration requirement:
$$
\boxed{ Q(\mathcal{X}) = \frac{\prod_{i \in V_{\mathcal{T}}} \beta_i}{\prod_{(i - j) \in E_{\mathcal{T}}} \mu_{i, j}} }
$$
Calibration requirement ensures $ Q $ statisfies the marginal consistency constrants: For each $ (i - j) \in E_{\mathcal{T}} $, the beliefs on $ S_{i, j} $ are the marginal of $ B_i $ (and $ B_j $).
$$
\mu_{i, j} (S_{i, j}) = \sum_{C_i - S_{i, j}} \beta_i (C_i) = \sum_{C_j - S_{i, j}} \beta_j (C_j)
$$
From Theorem 10.4
$$
\beta_i[c_i] = Q(c_i)
$$
$$
\mu_{i, j}[s_{i, j}] = Q(s_{i, j})
$$
CTree-Optimize-KL
Find $ Q = \{ \beta_i : i \in V_{\mathcal{T}} \} \cup \{ \mu_{i,j} : (i-j) \in E_{\mathcal{T}} \} $
Maximizing $ - \mathbb{D} (Q || P_\Phi) $
subject to
$$
\mu_{i, j} (s_{i, j}) = \sum_{C_i - S_{i, j}} \beta_i (c_i) \quad \forall (i - j) \in E_T, \forall s_{i, j} \in Val(S_{i, j})
$$
$$
\sum_{c_i} \beta_i (c_i) = 1 \quad \forall i \in V_T
$$
- if $ \mathcal{T} $ is a proper cluster tree for $ \Phi $, there is a set $ Q $ that induces a distribution $ Q = P_\Phi $.
- Because this solution achieves a relative entropy of 0, which is the highest value possible, it is the unique global optimum of this optimization.
Theorem 11.1
If $ \mathcal{T} $ is an I-map of $ P_\Phi $, then there is a unique solution to CTree-Optimize-KL.
Exact Inference as Optimization
- In Chapter 10, Belief Propagation results in a calibrated clique tree $ \rightarrow Q = P $.
- In this chapter, we are no longer in a clique tree but rather in a general cluster graph, which can have loops.
- Can’t use the clique tree message passing algorithm $ \rightarrow $ can’t guarantee calibrated tree: $ Q \not= P $
Goal:
- Introduce a set of approximate beliefs $ Q = \{ \beta_i(C_i), \mu_{ij} (S_{ij}) \} $.
- try to make $ Q $ as close as possible to $ P $.
- by optimizing the energy function.
The Energy Functional
functional?
Functions
- Input: variables
- Output: a value
- Full and partial derivatives $ \frac{df}{dx} $
- E.g. Maximize likelihood $ p(x \mid \theta) $ w.r.t. parameters $ \theta $
Functional
- Input: functions
- Output: a value
- Functional derivatives $ \frac{\delta F}{\delta f} $
- E.g., Maximize the entropy $ H[p(x)] $ w.r.t. p(x)
Theorem 11.2
$$
\boxed{ \mathbb{D} (Q || P_\Phi) = \ln Z - F[\tilde{P}_\Phi, Q] }
$$
where $ F[\tilde{P}_\Phi, Q] $ is the energy functional
$$
\begin{align*}
F[\tilde{P}_\Phi, Q] &= E_Q[\ln \tilde{P}(\mathcal{X})] + H_Q(\mathcal{X})
\end{align*}
$$
$$
= \sum_{\phi \in \Phi} E_Q [\ln \phi] + H_Q(\mathcal{X})
$$
- $ \sum_{\phi \in \Phi} E_Q [\ln \phi] $: energy term
- Each $ \phi $ appears as a seperate term.
- If factors $ \phi $ that comprise $ \Phi $ are small, each expectation deals with relatively few variables.
- Depends on $ Q $.
- Inference is “easy” in $ Q \rightarrow $ evaluate such expectations easy.
- $ H_Q(\mathcal{X}) $: entropy term
- Depends on $ Q $.
- involving the entropy of an entire joint distribution; thus, it cannot be tractably optimized.
Proof
$$
\mathbb{D} (Q || P_\Phi) = E_Q[\ln Q (\mathcal{X})] - E_Q[\ln P_\Phi (\mathcal{X})]
$$
- $ H_Q (\mathcal{X}) = - E_Q[\ln Q (\mathcal{X})] $.
- $ P_\Phi (\mathcal{X}) = \frac{1}{Z} \sum_{\phi \in \Phi} \phi \Rightarrow \ln P_\Phi = \sum_{\phi \in \Phi} \ln \phi - \ln Z $.
Thus,
$$
\mathbb{D} (Q || P_\Phi) = - H_Q (\mathcal{X}) - E_Q \left[ \sum_{\phi \in \Phi} \ln \phi \right] + E_Q[\ln Z]
$$
$$
= -F[\tilde{P}_\Phi, Q] + \ln Z
$$
ℹ️
Minimizing the relative entropy $ \mathbb{D} (Q || P_\Phi) $ is equivalent to maximizing the energy functional $ F[\tilde{P}_\Phi, Q] $.
Since $ \mathbb{D} (Q || P_\Phi) \geq 0 $,
$$
\ln Z \geq F[\tilde{P}_\Phi, Q]
$$
- $ F $ is a lower bound on $ \ln Z $.
- Recall: in directed models, $ Z = P(E) \rightarrow $ the hardest part of inference,
- if $ \mathbb{D} (Q || P_\Phi) $ is small $ \rightarrow $ get a good lower-bound approximation to $ Z $.