Pointwise Metrics Mislead: An Evaluation Protocol for… — Plain-Language Explanation

Original authors: Mads H. Baattrup, Jörn Bach, Laurids Jeppe, Finn Labe, Alexander Grohsjean, Christian Schwanenberger, Peer Stelldinger

Published 2026-05-25

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Mads H. Baattrup, Jörn Bach, Laurids Jeppe, Finn Labe, Alexander Grohsjean, Christian Schwanenberger, Peer Stelldinger

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Problem: The "Average" Trap

Imagine you are trying to guess the location of a hidden treasure. You have a map, but the map is a bit blurry. Sometimes, the treasure is definitely in the North cave, and sometimes it's definitely in the South cave. It is never in the middle.

In the world of science (like particle physics or medical imaging), scientists often use computers to solve these "guessing games." For a long time, they have judged how good a computer is by asking one simple question: "How close is your guess to the real answer?"

If the computer guesses "North" and the treasure is "North," it gets a high score. If it guesses "South" and the treasure is "North," it gets a low score.

The paper argues that this way of judging is broken when there are two possible answers (North and South).

If a computer is forced to give just one number as its answer to minimize its "error score," it will cheat. Instead of saying "It's either North or South," it will guess "Middle."

Why? Because mathematically, the "Middle" is the average of North and South. The distance from Middle to North is the same as Middle to South. So, the "Middle" guess has the lowest average error.
The Problem: The treasure is never in the Middle. The computer is giving a mathematically "perfect" average answer that is physically impossible.

The Consequence: A Blurry, Distorted Picture

The paper shows that when scientists use these "average" scores (called RMSE or MAE) to pick the best computer models, they accidentally pick models that flatten the truth.

Imagine you are trying to recreate a mountain range from blurry photos.

The Truth: Two sharp, distinct peaks (North and South).
The "Average" Model: It draws one single, wide, flat hill in the middle.

If you look at the "flat hill," it might look closer to the photos than the sharp peaks do, so the computer gets a better score. But if you use that flat hill to build a ski resort, you will be in big trouble because there are no actual peaks to ski on.

In science, these "peaks" and "tails" of the data contain the most important secrets (like the mass of a new particle). By forcing the computer to give a single "average" answer, we are accidentally smearing out the most important details, making our scientific measurements wrong.

The Solution: A New Three-Step Test

The authors propose a new way to test these computers, like a driving test with three different parts instead of just one.

1. The "Full Map" Test (CRPS)
Instead of asking for just one guess, we ask the computer to draw the whole map of possibilities.

Analogy: Instead of asking "Is the treasure North or South?", we ask, "Draw the probability map."
A good model will draw two distinct blobs (one for North, one for South). A bad model will draw one big blob in the middle. This test rewards models that admit, "I don't know exactly which one it is, but I know it's one of these two."

2. The "Crowd" Test (Spectrum Fidelity)
We look at the results of 10,000 guesses all together.

Analogy: If you ask 1,000 people to guess where the treasure is, and 500 say North and 500 say South, you get a perfect picture of the two caves. If the "average" model is used, everyone says "Middle," and you get a picture of a single, fake cave.
This test checks if the collection of guesses looks like the real world, not just if individual guesses are close.

3. The "Confidence" Test (Calibration)
We check if the computer is honest about how sure it is.

Analogy: If a weather app says there is a 90% chance of rain, it should rain 90% of the time. If it says 90% but it only rains 50% of the time, the app is lying about its confidence.
This test ensures the computer isn't just guessing wildly but is actually confident in the right places.

What They Found

The authors tested this new method on two things:

A fake math problem where they knew the exact answer.
A real physics problem involving top quarks (tiny particles) where two neutrinos (ghost particles) escape detection, making the math very tricky.

The Shocking Result:
The models that looked like the "winners" under the old "Average" test (the ones that gave the single, flat, middle answer) were actually the worst at preserving the true shape of the data.

The models that gave the "messy" two-blob answers (the ones that looked worse under the old test) were actually the best at telling the truth.

The Takeaway

The paper concludes that how you measure success determines what you find.

If you only measure "how close is the guess to the truth," you will build models that erase the interesting, complex parts of reality. To get the right scientific answer, you have to stop asking for a single number and start asking for the full story of possibilities.

In short: Don't just ask, "How close were you?" Ask, "Did you tell the whole story?"

Problem Statement

In scientific reconstruction (e.g., particle physics, medical imaging, geophysics), evaluation is currently dominated by pointwise metrics such as Root-Mean-Squared-Error (RMSE), Mean-Absolute-Error (MAE), and per-event resolution. These metrics operate under the implicit assumption that lower error equates to better reconstruction.

The authors argue that this assumption fails structurally for under-constrained inverse problems where the conditional posterior $p(z|x)$ is multimodal. In such scenarios, the optimal predictor under MSE is the conditional expectation $E[z|x]$ . For multimodal posteriors, this expectation often falls in regions of vanishing probability density (between modes). Consequently, models trained to minimize pointwise errors produce predictions that are individually "unphysical" and, when aggregated, systematically compress the marginal spectrum of the latent variable $z$ . This compression distorts the tails, modes, and shapes of distributions, which are the precise features downstream scientific measurements rely on.

Theoretical Foundation

The paper establishes a theoretical argument based on the Law of Total Variance:
$\text{Var}[z] = E[\text{Var}[z|x]] + \text{Var}[E[z|x]]$
The authors demonstrate that for any point estimator $f_\theta(x)$ converging to the conditional mean $E[z|x]$ , the variance of the predictions $\text{Var}[E[z|x]]$ is strictly less than or equal to the true marginal variance $\text{Var}[z]$ , with equality holding only if the posterior has zero width.

Implication: Point estimators inherently produce a marginal spectrum that is narrower than the truth. This is a bias, not a variance term, meaning it does not diminish with larger dataset sizes.
Consequence: Evaluating models solely by pointwise metrics actively rewards the suppression of posterior structure and penalizes models that preserve it, leading to biased scientific conclusions.

Methodology: A Three-Part Evaluation Protocol

To address these failure modes, the authors propose a three-metric protocol where each metric targets a specific deficiency missed by the others:

Per-Event Distributional Accuracy (CRPS):
- Uses the Continuous Ranked Probability Score (CRPS), a strictly proper scoring rule.
- Unlike RMSE/MAE, CRPS is minimized only when the predictive distribution matches the true posterior. It penalizes "posterior collapse" (predicting a single point in a multimodal space) rather than rewarding it.
- It reduces to MAE for point estimators, allowing fair comparison between generative and regression models.
Population-Level Spectrum Fidelity:
- Evaluates the marginal distribution $p(z)$ across the entire dataset, which is the quantity of interest for downstream physics.
- Uses a binned $\chi^2$ statistic comparing the histogram of predicted values against the true values.
- This metric detects the systematic compression of spectral features (tails and modes) that pointwise metrics miss.
Uncertainty Trustworthiness (Calibration):
- Assesses whether the width of the predicted posterior is trustworthy using conformal prediction to generate coverage curves.
- A perfectly calibrated model produces a coverage curve tracking the diagonal (empirical coverage equals nominal confidence level).
- This distinguishes between models that are merely sharp (narrow) and those that are both sharp and calibrated.

Key Contributions

Theoretical Proof: Demonstrated that any point estimator minimizing MSE or MAE produces a marginal spectrum strictly narrower than the truth whenever the posterior has nonzero variance, regardless of architecture or dataset size.
Evaluation Protocol: Introduced a unified protocol (CRPS, Spectrum Fidelity, Calibration) applicable across regression, mixture, and generative model families.
Empirical Validation: Showed that model rankings reverse between pointwise and distributional metrics on both synthetic and real-world benchmarks.

Experimental Results

Benchmark I: Synthetic Inverse Problem

Setup: A controlled problem with an analytically tractable bimodal posterior ( $x = z^2 + \epsilon$ ).
Findings:
- A standard Regression MLP achieved the lowest RMSE but collapsed the marginal spectrum to a spike at zero (the conditional mean), failing to represent the bimodal truth.
- Generative models (Normalizing Flows, Mixture Density Networks) had higher RMSE but achieved near-perfect CRPS and spectrum fidelity ( $\chi^2_{spec}$ close to degrees of freedom).
- Averaging the posterior samples of the Normalizing Flow recovered the Regression's poor RMSE and spectral distortion, confirming the Regression is simply the conditional mean of the Flow.

Benchmark II: Particle Physics (Top-Quark Reconstruction)

Setup: Reconstructing top-quark pairs from dileptonic decays (a many-to-one inverse problem with combinatorial ambiguity and missing neutrinos).
Findings:
- Pointwise Metrics: A Transformer trained with pure MSE achieved the best RMSE. A Transformer with MMD (Marginal Maximum Mean Discrepancy) regularization performed slightly worse.
- Distributional Metrics: The ranking flipped. A Discrete Normalizing Flow dominated on CRPS and spectrum fidelity. The Transformers, even with MMD regularization, failed to correct per-event multimodality, resulting in massive $\chi^2_{spec}$ values (orders of magnitude worse than flows).
- Calibration: While CRPS and spectrum fidelity distinguished the flows from transformers, calibration distinguished between the two flow architectures. The Discrete Flow (exact likelihood) was well-calibrated, whereas the Continuous Flow (approximate ODE-based likelihood) systematically undercovered, a distinction invisible to CRPS alone.

Significance and Claims

The paper claims that the evaluation protocol, not the model, determines the scientific conclusion. By relying on pointwise metrics, the scientific community has been inadvertently favoring models whose reconstructed spectra cannot support downstream measurements.

Structural Misalignment: The authors assert that pointwise metrics are structurally misaligned with the goals of scientific reconstruction in multimodal settings.
Necessity of the Protocol: The proposed three-step protocol is necessary to expose distinctions between architectures that appear identical under standard metrics (e.g., distinguishing between exact and approximate likelihood flows via calibration).
Domain Agnosticism: The findings apply to any inverse problem with non-negligible posterior variance (e.g., phase retrieval, cosmological inference), not just the specific benchmarks tested.

The authors conclude that careful evaluation using this protocol makes the bias of pointwise-only evaluation visible, providing practitioners with a basis for comparison that scientific conclusions can rest on. They note that while their findings are robust, the absolute performance values are specific to their experimental setup, and the ranking flip itself is the robust, generalizable result.

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems