Sample-efficient evidence estimation of score based priors for model selection

Imagine you are a detective trying to solve a blurry, distorted photo of a crime scene. You have the blurry photo (the measurement), but you need to figure out what the original, clear scene looked like (the image).

In the world of science and engineering, this is called an inverse problem. The problem is that the blurry photo could have come from many different clear scenes. To pick the right one, you need a "rulebook" or a "gut feeling" about what a crime scene usually looks like. In math, this rulebook is called a Prior.

The Big Problem: Choosing the Right Rulebook

Usually, scientists just pick a rulebook they think is good. But what if they pick the wrong one?

If you use a rulebook that says "crime scenes are always in forests," but the crime happened in a city, your detective work will be biased and wrong.
If you use a rulebook that says "crime scenes are always in cities," but it happened in a forest, you'll be wrong again.

The ideal solution is to ask: "Which rulebook is most likely to have produced this specific blurry photo?" In math, this is called calculating the Model Evidence.

The Old Way: The Impossible Math

For a long time, calculating this "Model Evidence" was like trying to count every single grain of sand on a beach to find one specific grain. It required so much computing power that it was impossible for the most advanced AI models (called Diffusion Models) that scientists use today.

Diffusion models are like a master painter who can take a canvas covered in static noise and slowly, step-by-step, turn it into a clear picture. They are amazing at filling in the blanks. But, because they are so complex, we couldn't easily ask them, "How likely is it that you created this specific blurry photo?"

The New Solution: DiME (Diffusion Model Evidence)

The authors of this paper, Frederic Wang and Katherine Bouman, invented a new tool called DiME.

Here is how DiME works, using a simple analogy:

The Analogy: The Hiking Trail

Imagine the Diffusion Model is a hiker walking down a mountain trail from the peak (pure noise) to the valley (the clear image).

The Old Way: To know how likely the hiker was to end up at a specific spot, you had to stop them at every single step, measure the wind, the slope, and the mud, and do a massive calculation. It took forever and often gave wrong answers.
The DiME Way: DiME is like a smart observer who just watches the hiker's path.
- As the hiker walks down, they naturally leave a trail of footprints (intermediate samples).
- DiME doesn't need to stop the hiker or do complex math. It just looks at the distance between the hiker's path and the "expected" path of a random walker.
- If the hiker's path stays very close to the expected path, the rulebook (Prior) is a good fit.
- If the hiker has to take a weird, winding detour to get to the spot, the rulebook is a bad fit.

By simply measuring the "detours" the hiker takes, DiME can calculate the Model Evidence with just 20 steps (samples), whereas other methods needed thousands.

What Did They Prove?

The authors tested DiME in three ways:

The Math Test: They used a simple, known math problem where the answer was already written down. DiME got the answer almost perfectly, beating all the old, heavy-duty methods.
The "Guess the Digit" Test: They showed the AI a blurry, noisy picture of a handwritten number (like a '6' or a '9'). They had 10 different rulebooks (one for each digit 0-9).
- Old methods often guessed the wrong digit because they got confused by the noise.
- DiME correctly identified the digit every single time, even when the image was very blurry.
The Black Hole Test (The Real Deal): This is the coolest part. They used DiME on real data from the Event Horizon Telescope, which took the first picture of a black hole (M87*).
- They had different rulebooks: one based on black hole physics, one based on general space photos, one based on human faces, and one based on handwritten digits.
- DiME's Verdict: It correctly said, "The rulebook based on Black Hole Physics is the only one that makes sense for this photo." It even told them that the photo of the black hole fits perfectly within the laws of physics they used to create the rulebook.

Why Does This Matter?

Before DiME, scientists using these powerful AI models had to guess which "rulebook" to use. If they guessed wrong, their scientific conclusions could be biased or wrong.

DiME gives scientists a "truth meter."

It allows them to select the best AI model for a specific job.
It allows them to validate if their physical theories (like how black holes work) actually match reality.
It does all this quickly and efficiently, using the "footprints" the AI leaves behind while it works.

In short, DiME turns a black box into a transparent one, letting us trust the AI's answers in critical scientific discoveries like imaging black holes.

1. Problem Statement

In Bayesian inverse imaging problems, the choice of the prior distribution $p(x)$ is critical for reconstructing solutions from ill-posed measurements $y$ . If the chosen prior does not align with the true data distribution, reconstructions become severely biased. Ideally, one should select the prior model $M_i$ that maximizes the model evidence (marginal likelihood) $p(y | M_i)$ .

However, computing model evidence is intractable for complex, high-dimensional priors because it requires integrating the likelihood over the entire prior space:
$\log p(y | M_i) = \log \int p(y | x, M_i) p(x | M_i) dx$

Existing estimation methods (e.g., Sequential Monte Carlo, Nested Sampling, Annealed Importance Sampling) face significant challenges when applied to Diffusion Models:

Score/Density Dependency: Most methods require accurate evaluations of the unnormalized prior density or the clean prior score $\nabla_x \log p(x)$ . Diffusion models learn scores at intermediate noise levels, and estimates of the clean score are often inaccurate or ill-conditioned.
Computational Cost: High-variance estimators often require thousands of posterior samples, which is computationally prohibitive for diffusion-based sampling.
Bias: Traditional methods often fail to explore the full distribution effectively when the data geometry is ill-conditioned, leading to biased evidence estimates.

2. Methodology: DiME (Diffusion Model Evidence)

The authors propose DiME, a sample-efficient estimator that calculates model evidence by integrating over the time-marginals of posterior sampling trajectories.

Core Theoretical Insight

The method leverages the relationship between the model evidence, the expected log-likelihood, and the Kullback-Leibler (KL) divergence between the posterior and the prior:
$\log p(y) = \mathbb{E}_{x_0 \sim p(x_0|y)}[\log p(y | x_0)] - D_{KL}(p(x_0 | y) || p(x_0))$

Instead of estimating the KL divergence directly (which is hard), DiME expresses it as an integral along the diffusion path from pure noise ( $t=T$ ) to the clean image ( $t=0$ ):
$D_{KL}(p(x_0 | y) || p(x_0)) \approx \sum_{i=1}^N c_{t_i} \Delta t_i \, \mathbb{E}_{x_{t_i} \sim p(x_{t_i}|y)} \left[ \| \nabla_{x_{t_i}} \log p(y | x_{t_i}) \|^2 \right]$

Key Technical Components

Integration along Time-Marginals: DiME utilizes the intermediate samples naturally generated during the reverse diffusion sampling process (e.g., via Decoupled Annealing Posterior Sampling - DAPS). It does not require separate, expensive sampling runs.
Unbiased Score Estimation: Directly computing the likelihood score $\nabla_{x_t} \log p(y | x_t)$ $\nabla_{x_{t}} lo g p (y ∣ x_{t})$ is intractable. The authors derive two unbiased estimators for this score based on samples $\tilde{x}_0 \sim p(x_0 | x_t, y)$ $\tilde{x}_{0} \sim p (x_{0} ∣ x_{t}, y)$ :
- High-noise estimator ( $\Theta_{high}$ ): Uses the distance between the sampled clean image and the conditional mean. Effective at high noise levels.
- Low-noise estimator ( $\Theta_{low}$ ): Uses the gradient of the likelihood at the sampled clean image. Effective at low noise levels.
- Variance Reduction: To estimate the squared norm $\|\nabla \log p(y|x_t)\|^2$ without bias, the method samples two independent clean images $\tilde{x}_0^{(1)}, \tilde{x}_0^{(2)}$ for each timestep and computes their cross-product.
Improved Covariance Approximation: For the Gaussian approximation of the posterior $p(x_0 | x_t)$ , the authors propose a refined covariance matrix that incorporates the prior covariance $\Sigma_0$ , correcting the overestimation of variance at high noise levels found in previous DAPS implementations.
Sample Efficiency: The method achieves accurate estimates with as few as 20 posterior sample paths, whereas baseline methods often require thousands.

3. Key Contributions

DiME Estimator: A novel, sample-efficient method to estimate model evidence for diffusion-based priors without requiring the clean prior score or density.
Theoretical Derivation: A rigorous derivation showing how to compute the KL divergence term by integrating the squared likelihood score along the posterior time-marginals.
Practical Implementation: Integration with state-of-the-art posterior sampling methods (DAPS and PnP-DM), including specific algorithms for variance reduction and covariance approximation.
Generalization: The framework is extended to arbitrary marginal paths (DiME-PnPDM), allowing for model selection even when the sampling path does not strictly follow standard marginals.

4. Experimental Results

The authors validate DiME across three distinct scenarios:

A. Gaussian Mixture Prior (Analytic Ground Truth)

Setup: A 1000-D Gaussian mixture with known analytic evidence.
Result: DiME provided nearly unbiased estimates (relative error < 3%) across in-distribution, out-of-distribution, and saddle-point scenarios. It outperformed baselines like Naive MC, TI, AIS, and SMC, which suffered from high variance or bias due to reliance on the clean prior score.

B. Non-Convex Inverse Problems (MNIST Phase Retrieval)

Setup: Gaussian and Fourier phase retrieval tasks using 10 different diffusion priors trained on MNIST digits.
Result: Given a single noisy measurement, DiME consistently selected the correct digit prior. In contrast, baselines (SMC) frequently failed, selecting incorrect models. DiME also correctly identified that visually similar digits (e.g., 4 and 9) or symmetric transformations (flips in Fourier space) yielded higher evidence.

C. Real-World Application: M87 Black Hole Imaging*

Setup: Model selection on Event Horizon Telescope (EHT) data of the M87* black hole. Five priors were tested: GRMHD (physics-based), RIAF (physics-based), SpaceNet (astronomy), Faces, and MNIST.
Model Selection: DiME identified the GRMHD prior as the most likely model, followed by SpaceNet. It correctly penalized the RIAF prior for being too narrow (overly restrictive) and the MNIST prior for being out-of-distribution.
Model Validation: By comparing the evidence of the real M87* data against a distribution of synthetic GRMHD data, DiME calculated a z-score of -0.81 ( $p=0.209$ ). This statistically confirms that the M87* observations are in-distribution with respect to the GRMHD physical model, validating the current theoretical understanding of black hole accretion flows while leaving room for refinement.
Efficiency: The Gaussian approximation of DAPS used in DiME achieved 7x speedup over exact DAPS with negligible loss in accuracy.

5. Significance

Bridging Theory and Practice: DiME solves a fundamental bottleneck in Bayesian deep learning: the inability to perform principled model selection with powerful, data-driven diffusion priors.
Scientific Reliability: By enabling rigorous model validation (checking if data is in-distribution), DiME moves beyond simple image reconstruction to providing statistical confidence in scientific discoveries (e.g., confirming black hole physics models).
Efficiency: The ability to estimate evidence with ~20 samples makes diffusion-based model selection feasible for high-dimensional, real-world problems where previous methods were computationally intractable.
Generalizability: The approach is applicable to any inverse problem solvable with diffusion posterior sampling, offering a new tool for uncertainty quantification and model diagnosis in physics, astronomy, and medical imaging.