Functional Bias and Tangent-Space Geometry in Variational Inference

Here is an explanation of the paper "Functional Bias and Tangent-Space Geometry in Variational Inference" using simple language, analogies, and metaphors.

The Big Picture: The "Good Enough" Map

Imagine you are trying to navigate a complex, mountainous terrain (the True Reality or the Posterior Distribution). You need to know specific details: How high is the peak? How steep is the slope? Is there a hidden valley?

However, the terrain is too complex to map perfectly. It's too much data to process. So, you decide to use a simplified map (the Variational Approximation). This map is drawn on a flat piece of paper or a simple grid. It's easy to read and fast to use, but it can't capture every twist and turn of the real mountains.

The Problem: Because your map is simplified, it will be wrong about certain things. The paper asks: Which things will the map get right, and which things will it get wrong, and why?

The Core Idea: The "Shape" of Your Map

The author, Sean Plummer, uses a geometric idea called Tangent Space to explain this.

Think of your simplified map as a specific shape.

If your map is a flat sheet of paper, it can only represent flat things perfectly.
If your map is a grid of separate squares (this is called "Mean-Field"), it can only represent things that happen independently in each square. It cannot represent things where one square affects another.

The paper introduces a rule: The "Bias" (the error) depends on whether the thing you are measuring fits inside the shape of your map.

The Two Types of Errors

The "Fits Perfectly" Group (Second-Order Bias):
If the thing you want to measure is something your map shape can naturally describe, the error is tiny. It's like measuring the width of a square on a square grid; the map gets it almost exactly right.
- Example: If you want to know the average height of the mountains in just the North block, and your map treats the North block independently, you get a great answer.
The "Doesn't Fit" Group (First-Order Bias):
If the thing you want to measure involves connections between different parts of the map, your simplified shape fails completely. The error is large and systematic.
- Example: If you want to know how the weather in the North block affects the weather in the South block, a map that treats them as separate squares will tell you they have no relationship at all. It will say, "They are independent," even if they are actually storming together. This is a huge, predictable mistake.

The "Mean-Field" Analogy: The Silo Effect

The paper focuses heavily on a popular method called Structured Mean-Field.

Imagine a company with different departments (Marketing, Engineering, Sales).

The Real World: These departments talk to each other constantly. Marketing changes affect Engineering, which affects Sales.
The Mean-Field Map: This method forces the company to act like a set of Silos. It assumes Marketing doesn't know what Engineering is doing, and Engineering doesn't know about Sales.

The Result:

If you ask, "How much money does Marketing make?" (a single block), the Silo map gives a good answer.
If you ask, "How does a change in Marketing affect Sales?" (a connection between blocks), the Silo map gives a terrible answer. It assumes the connection is zero, even if it's huge.

The paper proves mathematically that this "Silos" approach will always underestimate the connections between different parts of the system.

The "Tangent Space" Metaphor: The Dance Floor

Imagine the "True Reality" is a complex dance floor where everyone is moving in a giant, intricate pattern.

The Variational Family is a group of dancers who are only allowed to move in specific, simple ways (e.g., only moving forward/backward or only left/right, but never diagonally together).
The Tangent Space is the set of all the moves this group can do.
The Bias is what happens when the real dance requires a move the group can't do (like a diagonal spin).

The paper says:

If the real dance move is within the group's allowed moves (the Tangent Space), the group can mimic it perfectly.
If the real dance move is outside their allowed moves (the Orthogonal Complement), the group will fail to capture it, and the error will be huge.

What This Means for Real Life

Why should you care?

Don't trust the "Connections": If you use these simplified AI models to predict how different variables interact (like "How does interest rate affect stock prices?"), be very careful. The model is likely to tell you there is no relationship, even if there is one.
Trust the "Averages": If you just want to know the average value of a single variable (like "What is the average temperature?"), these models are usually very accurate.
Better Maps exist: The paper suggests that if you want to capture those tricky "connections," you need to change the shape of your map (the Variational Family) so it includes those connections in its "Tangent Space."

Summary in One Sentence

This paper explains that simplified AI models are great at predicting individual parts of a system, but they systematically fail to predict how those parts influence each other, because the "shape" of the model simply doesn't have room to hold those connections.

Here is a detailed technical summary of the paper "Functional Bias and Tangent-Space Geometry in Variational Inference" by Sean Plummer.

1. Problem Statement

Variational Inference (VI) approximates complex Bayesian posterior distributions by projecting them onto a tractable family of distributions (the variational family) via Kullback–Leibler (KL) divergence minimization. While VI is computationally scalable, it introduces systematic bias.

Existing theoretical analyses primarily focus on global divergence measures (e.g., KL divergence) or posterior contraction rates. However, practical applications often rely on specific posterior summaries (functionals) such as expectations, variances, covariances, or tail probabilities. The central problem addressed in this paper is:

Which posterior functionals can be accurately estimated by a variational approximation?
What is the geometric structure of the bias introduced by the variational family for these specific functionals?

2. Methodology: Geometric Framework

The author develops a geometric framework based on the tangent space of the variational family within the Hilbert space $L^2(q^*)$ , where $q^*$ is the optimal variational approximation (the KL projection of the true posterior $\pi$ ).

Key Definitions and Tools:

Residual: Defined as $\Delta(\theta) = \log \frac{q^*(\theta)}{\pi(\theta)}$ . This measures the log-density discrepancy between the approximation and the true posterior.
Tangent Space ( $T_{q^*}Q$ ): The linear span of score functions (derivatives of the log-density with respect to variational parameters) at the optimum $q^*$ . It represents the directions in which the variational distribution can be locally perturbed while remaining within the family.
Orthogonality: A key lemma establishes that the residual $\Delta$ is orthogonal to the tangent space $T_{q^*}Q$ under the expectation $E_{q^*}$ . This is a consequence of the KL optimality conditions.
Change-of-Measure Expansion: The paper utilizes a Taylor expansion of the exponential function to relate expectations under $\pi$ and $q^*$ :
$E_\pi[g] - E_{q^*}[g] = -E_{q^*}[g\Delta] + E_{q^*}[g\rho(\Delta)]$
where $\rho(x) = e^{-x} - 1 + x$ .

3. Key Contributions

The paper makes five primary contributions:

Functional Bias Decomposition: Derives an identity expressing the bias of a posterior functional $g$ in terms of its projection onto the orthogonal complement of the variational tangent space.
Second-Order Bias for Aligned Functionals: Proves that functionals lying entirely within the tangent space incur only second-order bias (i.e., bias proportional to $\|\Delta\|^2$ ), whereas components orthogonal to the space generate first-order bias.
Explicit Characterization for Mean-Field: For structured mean-field families, explicitly characterizes the tangent space as the set of block-additive functions. Consequently, the orthogonal complement consists of interaction terms coupling multiple parameter blocks.
Asymptotic Bias Expansions: Under Local Asymptotic Normality (LAN) conditions, derives explicit asymptotic expansions showing that omitted interaction directions cause first-order distortion in cross-block dependence measures (e.g., covariances).
Geometric Explanation of Known Phenomena: Provides a rigorous geometric explanation for why mean-field VI systematically distorts cross-block dependencies while accurately capturing marginal/additive summaries.

4. Main Results

A. The Variational Projection Identity (Theorem 1)

For any functional $g \in L^2(q^*)$ , decomposed into $g = g_\parallel + g_\perp$ (where $g_\parallel \in T_{q^*}Q$ and $g_\perp \perp T_{q^*}Q$ ):
$E_\pi[g] - E_{q^*}[g] = -\langle g_\perp, \Delta \rangle_{L^2(q^*)} + O(\|\Delta\|^2_{L^2(q^*)})$
Implication: The leading-order bias is determined solely by the component of the functional orthogonal to the tangent space. If $g$ is in the tangent space, the bias is $O(\|\Delta\|^2)$ .

B. Geometry of Structured Mean-Field (Theorem 2)

For a parameter vector partitioned into blocks $\theta = (\theta_{B_1}, \dots, \theta_{B_m})$ , the tangent space of the structured mean-field family consists of functions of the form:
$f(\theta) = \sum_{b=1}^m f_b(\theta_{B_b})$
where each $f_b$ has zero mean under its respective block distribution.

Result: Additive summaries of blocks are captured accurately.
Result: Interaction terms (functions depending on multiple blocks simultaneously) lie in the orthogonal complement and drive the leading-order bias.

C. Two-Block Decomposition (Section 5)

Using the Hoeffding (ANOVA) decomposition for two blocks $X, Y$ :
$h(X, Y) = \mu + h_X(X) + h_Y(Y) + h_{XY}(X, Y)$
The bias is driven entirely by the interaction term $h_{XY}$ :
$E_\pi[h] - E_{q^*}[h] = -E_{q^*}[h_{XY}\Delta] + O(\|\Delta\|^2)$

D. Local Asymptotic Bias (Theorem 3 & Proposition 3)

In regular parametric models where the posterior converges to a Gaussian $N(\mu_n, \Sigma/n)$ and the variational approximation to $N(\mu_n, V/n)$ :

The asymptotic bias is:
$E_{\pi_n}[g] - E_{q^*_n}[g] = \frac{1}{2n}\text{tr}(H_g(\mu_n)(\Sigma - V)) + o_p(n^{-1})$
Cross-Covariance Distortion: For $g(\theta) = \theta_i \theta_j$ ( $i \neq j$ ), if the variational family is mean-field (implying $V$ is diagonal), the bias is:
$E_{\pi_n}[\theta_i \theta_j] - E_{q^*_n}[\theta_i \theta_j] = \frac{\Sigma_{ij}}{n} + o_p(n^{-1})$
This confirms that mean-field VI fails to capture cross-block covariances at the first order, regardless of sample size, if the true posterior has non-zero correlation.

5. Significance and Implications

Beyond Global Divergence: The paper shifts the focus from global approximation quality (KL divergence) to the accuracy of specific statistical summaries. It explains why VI works well for some quantities (marginals) but fails for others (dependencies).
Geometric Intuition: By linking VI bias to the orthogonal complement of the tangent space, the paper draws a parallel to semiparametric efficiency theory, where estimation error depends on the projection of the influence function onto the model tangent space.
Structured Variational Families: The results provide a theoretical justification for using structured mean-field approximations (grouping correlated variables into blocks). Enlarging the block structure expands the tangent space, reducing the dimension of the orthogonal complement and thereby reducing the bias for interaction-sensitive functionals.
Diagnostic Potential: The framework suggests that practitioners can diagnose which posterior summaries are likely to be biased by checking if their influence functions lie in the orthogonal complement of the variational tangent space.

In conclusion, the paper establishes that the geometry of the variational family dictates the bias structure of posterior functionals. Specifically, the inability of mean-field approximations to represent interaction terms (which lie outside the tangent space) leads to systematic, first-order errors in estimating cross-block dependencies.