Calibrated Test-Time Guidance for Bayesian Inference

Imagine you have a incredibly talented artist who has spent years learning to paint beautiful, realistic landscapes. This artist is your Diffusion Model. They know how to paint a forest, a mountain, or a city perfectly because they've seen millions of them.

Now, imagine you want this artist to paint a specific scene: "A forest, but with a giant, glowing blue moon in the sky."

This is where Test-Time Guidance comes in. It's like giving the artist a set of instructions while they paint. You say, "Keep the forest, but push the colors toward that blue moon idea."

The Problem: The "Good Enough" Artist

The paper argues that the current methods for giving these instructions are flawed. They are like a boss who says, "Just guess what the blue moon looks like based on the average forest you've seen," or "Just make the blue moon brighter and brighter until it hurts your eyes."

These methods work okay. They get you a picture that looks like a forest with a moon. But if you ask the artist, "What are the odds of this specific moon being here?" or "What are all the other possible versions of this scene?", the artist gives you the wrong answer. The math is "miscalibrated."

The Analogy:
Think of it like trying to find a lost hiker in a foggy forest.

The Truth (Bayesian Posterior): You want to know the entire map of where the hiker could possibly be, with probabilities for every spot.
Old Methods: They just point to the single spot that looks most likely and say, "They are definitely here." If you ask, "What if they are 10 feet to the left?" the old method says, "No, impossible," even though they might actually be there. They are biased; they are too confident in the wrong answer.

The Discovery: Why Old Methods Fail

The authors dug into the math and found two main reasons why the old methods fail:

The "Average" Trap: Instead of checking every possible version of the forest to see which ones have the blue moon, the old methods just look at the "average" forest and check that one. It's like trying to guess the weather for a whole week by only looking at the temperature at noon on Tuesday. You miss the rain, the wind, and the fog.
The "Volume Knob" Trap: To make the moon brighter, people just turn up a "guidance volume knob." But mathematically, turning up the volume on the instructions doesn't just make the moon brighter; it distorts the whole picture in a weird way that breaks the math. It's like turning up the bass on a speaker until the music sounds like a completely different song.

The Solution: Calibrated Bayesian Guidance (CBG)

The authors propose a new way to guide the artist, called Calibrated Bayesian Guidance (CBG).

How it works (The Creative Analogy):
Instead of asking the artist to guess the moon based on one average forest, CBG says:
"Artist, imagine 1,000 different versions of this forest right now. For each one, check: 'Does this version have a blue moon?' If yes, give it a high score. If no, give it a low score. Then, average all those scores together to decide the next brushstroke."

This is the Consistent Estimator.

Old Way: Look at one guess, make a decision. (Fast, but wrong).
New Way (CBG): Take a sample of 1,000 guesses, weigh them all, and make a decision. (Slower, but mathematically perfect).

The paper proves that if you keep doing this (increasing your "compute budget" or the number of guesses), you eventually get the true, perfect map of where the hiker could be. You get the real probability distribution, not just a guess.

Two Versions of the New Method

The authors offer two tools for this:

Gradient-Based: Like a GPS that calculates the exact slope of the hill to guide you. It's precise but requires a lot of computing power to calculate the slope.
Gradient-Free: Like a hiker who just throws 1,000 pebbles in different directions to see which way the wind blows. It doesn't need complex math, just a lot of samples. This is surprisingly effective and easier to use.

Why Does This Matter?

For making pretty pictures (like art or memes), the old "good enough" methods are fine. You just want a cool picture.

But for Science, it matters a huge amount.

Black Hole Imaging: The paper tested this on reconstructing images of black holes. In science, you don't just want a pretty picture; you need to know the uncertainty. "How sure are we that this ring is real? Is it a glitch?"
Medical Imaging: "Is this a tumor, or just a shadow?"
Climate Modeling: "What is the real range of possible temperatures?"

If your method is "miscalibrated," you might be 99% sure of a wrong answer, which is dangerous in science. The new method (CBG) ensures that when you say "I am 90% sure," you actually are 90% sure.

Summary

The Problem: Current AI tools for solving puzzles (inverse problems) are fast but mathematically "liars." They give confident but wrong answers about probabilities.
The Cause: They use shortcuts (averages and volume knobs) that break the math.
The Fix: A new method called Calibrated Bayesian Guidance that takes many samples to get the true answer.
The Result: It's slower, but it gives you the honest, scientifically accurate truth, especially for critical tasks like imaging black holes or diagnosing diseases. It trades speed for truth.

1. Problem Statement

Diffusion models are powerful generative priors used to solve inverse problems (e.g., image reconstruction, super-resolution) by steering the generation process toward a desired outcome defined by a reward function or likelihood $p(y|x)$ . The goal is to sample from the Bayesian posterior $p(x|y) \propto p(x)p(y|x)$ , where $p(x)$ is the pretrained diffusion prior.

However, existing test-time guidance methods (such as Diffusion Posterior Sampling - DPS, Loss-Guided Diffusion - LGD, and others) fail to recover the true posterior distribution. Instead, they produce miscalibrated samples from biased distributions. The authors identify two primary structural flaws in current approaches:

Inconsistent Likelihood Estimators: Common methods approximate the "diffused likelihood" $p(y|x_t)$ (the likelihood at a noisy state $x_t$ ) using the Posterior Mean Approximation (evaluating the likelihood at the mean of the denoised distribution) or a Gaussian Approximation. The paper proves these are biased estimators that do not converge to the true value even with infinite compute.
Incorrect Guidance Scaling: To control the trade-off between the prior and the likelihood, methods often rescale the gradient of the likelihood by a factor $\gamma$ (the guidance scale). The paper proves that simply rescaling the gradient ( $\nabla \log p(y|x_t) \cdot \gamma$ ) is mathematically incorrect for obtaining a tempered posterior $p(x|y, \gamma) \propto p(x)p(y|x)^\gamma$ . The scaling must be applied inside the integral defining the diffused likelihood, not outside.

2. Methodology: Calibrated Bayesian Guidance (CBG)

The authors propose Calibrated Bayesian Guidance (CBG), a framework that provides consistent estimators for the diffused likelihood, ensuring that increasing the computational budget leads to convergence to the true Bayesian posterior.

The core idea is to directly approximate the integral definition of the diffused likelihood:
$p(y|x_t) = \int p(x|x_t)p(y|x)dx$

They derive two distinct estimators based on the differentiability of the reward function:

A. Gradient-Based Calibrated Bayesian Guidance

Applicability: When the likelihood $p(y|x)$ is differentiable.
Mechanism: Uses the reparameterization trick. By sampling $x^{(i)} \sim p(x|x_t)$ , the gradient of the log-likelihood is estimated as:
$\nabla_{x_t} \log p(y|x_t) \approx \frac{1}{\sum w_i} \sum_{i=1}^K \nabla_{x_t} p(y|x^{(i)})$
where $w_i = p(y|x^{(i)})$ .
Property: This is a consistent estimator; as the number of samples $K \to \infty$ , the bias vanishes.

B. Gradient-Free Calibrated Bayesian Guidance

Applicability: When the likelihood is non-differentiable or computationally expensive (e.g., black-box simulators).
Mechanism: Uses a REINFORCE-style estimator. By rewriting the score function using the identity $\nabla \log p = p^{-1} \nabla p$ , they derive:
$\nabla_{x_t} \log p(x_t|y) \approx \frac{1}{\sum w_i} \sum_{i=1}^K w_i \left( \frac{a_t x^{(i)} - x_t}{b_t^2} \right)$
where $w_i = p(y|x^{(i)})$ and the term in parentheses is the score of the diffusion prior (derived from the forward process).
Property: This is also a consistent estimator. Surprisingly, the authors find empirically that this gradient-free method often has lower variance than the gradient-based method in high-dimensional settings with sharp likelihoods, due to the self-normalization effect of the weights.

3. Key Contributions

Theoretical Proof of Bias: The authors provide formal proofs (Theorems 4.1–4.3) demonstrating that existing estimators (Posterior Mean, Gaussian Approximation, and simple gradient rescaling) are fundamentally biased and cannot recover the true posterior, regardless of computational resources.
Consistent Estimators: They introduce CBG, a framework that yields consistent estimators for the diffused likelihood, allowing for arbitrarily accurate approximation of the Bayesian posterior by increasing the sample count $K$ .
Gradient-Free Efficiency: They demonstrate that a gradient-free REINFORCE-based approach is not only valid for non-differentiable objectives but is often more stable and lower-variance than gradient-based approaches in practice.
Unified Framework: The method supports both standard Bayesian inference ( $\gamma=1$ ) and tempered inference ( $\gamma \neq 1$ ) by correctly applying the scaling within the integral approximation.

4. Experimental Results

The authors evaluate CBG on two main benchmarks:

Bayesian Inference Benchmark (Lueckmann et al., 2021):
- Tasks: Five diverse inverse problems with known analytical priors and likelihoods.
- Metric: C2ST (Classifier Two-Sample Test), where lower is better (0.5 indicates indistinguishable from ground truth).
- Results: CBG (both gradient-free and gradient-based) significantly outperforms all existing test-time guidance methods (DPS, LGD, DPG, SCG). Crucially, while other methods plateau at suboptimal C2ST scores as compute increases, CBG's performance improves monotonically with more samples, converging toward the optimal 0.5. The gradient-free variant achieved the best average performance.
Black Hole Image Reconstruction:
- Task: Reconstructing black hole images from radio telescope measurements using a pretrained diffusion model as a prior.
- Metric: Peak Signal-to-Noise Ratio (PSNR).
- Results: CBG matches the state-of-the-art PSNR (26.10 dB), outperforming traditional methods and other diffusion guidance baselines (DPS, PnP-DM, DiffPIR). Qualitative analysis shows CBG produces images that are visually consistent with the ground truth, whereas other methods often suffer from blurring or artifacts.

5. Significance

This work addresses a critical gap in the application of diffusion models to scientific and inverse problems.

Scientific Rigor: For applications like medical imaging, climate modeling, or astronomy, generating "visually pleasing" images is insufficient; the generated samples must accurately reflect the uncertainty of the posterior distribution. Existing methods fail to provide this calibration.
Theoretical Correction: The paper corrects a widespread misconception in the community regarding how to scale guidance terms and approximate diffused likelihoods.
Practical Utility: By providing a gradient-free consistent estimator, the method makes calibrated Bayesian inference accessible for problems where the likelihood is a complex, non-differentiable simulator, a common scenario in scientific computing.

In summary, the paper establishes that consistency (convergence to the true posterior with more compute) is achievable in diffusion guidance, provided one moves away from mean-field approximations and simple gradient scaling toward consistent Monte Carlo integration of the likelihood.