Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

This paper argues that standard pointwise metrics like RMSE and MAE structurally fail to evaluate multimodal inverse problems by systematically biasing reconstructions toward narrower distributions, and proposes a three-part evaluation protocol based on distributional accuracy, spectrum fidelity, and uncertainty calibration to ensure scientifically valid conclusions.

Original authors: Mads H. Baattrup, Jörn Bach, Laurids Jeppe, Finn Labe, Alexander Grohsjean, Christian Schwanenberger, Peer Stelldinger

Published 2026-05-25
📖 5 min read🧠 Deep dive

Original authors: Mads H. Baattrup, Jörn Bach, Laurids Jeppe, Finn Labe, Alexander Grohsjean, Christian Schwanenberger, Peer Stelldinger

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Problem: The "Average" Trap

Imagine you are trying to guess the location of a hidden treasure. You have a map, but the map is a bit blurry. Sometimes, the treasure is definitely in the North cave, and sometimes it's definitely in the South cave. It is never in the middle.

In the world of science (like particle physics or medical imaging), scientists often use computers to solve these "guessing games." For a long time, they have judged how good a computer is by asking one simple question: "How close is your guess to the real answer?"

If the computer guesses "North" and the treasure is "North," it gets a high score. If it guesses "South" and the treasure is "North," it gets a low score.

The paper argues that this way of judging is broken when there are two possible answers (North and South).

If a computer is forced to give just one number as its answer to minimize its "error score," it will cheat. Instead of saying "It's either North or South," it will guess "Middle."

  • Why? Because mathematically, the "Middle" is the average of North and South. The distance from Middle to North is the same as Middle to South. So, the "Middle" guess has the lowest average error.
  • The Problem: The treasure is never in the Middle. The computer is giving a mathematically "perfect" average answer that is physically impossible.

The Consequence: A Blurry, Distorted Picture

The paper shows that when scientists use these "average" scores (called RMSE or MAE) to pick the best computer models, they accidentally pick models that flatten the truth.

Imagine you are trying to recreate a mountain range from blurry photos.

  • The Truth: Two sharp, distinct peaks (North and South).
  • The "Average" Model: It draws one single, wide, flat hill in the middle.

If you look at the "flat hill," it might look closer to the photos than the sharp peaks do, so the computer gets a better score. But if you use that flat hill to build a ski resort, you will be in big trouble because there are no actual peaks to ski on.

In science, these "peaks" and "tails" of the data contain the most important secrets (like the mass of a new particle). By forcing the computer to give a single "average" answer, we are accidentally smearing out the most important details, making our scientific measurements wrong.

The Solution: A New Three-Step Test

The authors propose a new way to test these computers, like a driving test with three different parts instead of just one.

1. The "Full Map" Test (CRPS)
Instead of asking for just one guess, we ask the computer to draw the whole map of possibilities.

  • Analogy: Instead of asking "Is the treasure North or South?", we ask, "Draw the probability map."
  • A good model will draw two distinct blobs (one for North, one for South). A bad model will draw one big blob in the middle. This test rewards models that admit, "I don't know exactly which one it is, but I know it's one of these two."

2. The "Crowd" Test (Spectrum Fidelity)
We look at the results of 10,000 guesses all together.

  • Analogy: If you ask 1,000 people to guess where the treasure is, and 500 say North and 500 say South, you get a perfect picture of the two caves. If the "average" model is used, everyone says "Middle," and you get a picture of a single, fake cave.
  • This test checks if the collection of guesses looks like the real world, not just if individual guesses are close.

3. The "Confidence" Test (Calibration)
We check if the computer is honest about how sure it is.

  • Analogy: If a weather app says there is a 90% chance of rain, it should rain 90% of the time. If it says 90% but it only rains 50% of the time, the app is lying about its confidence.
  • This test ensures the computer isn't just guessing wildly but is actually confident in the right places.

What They Found

The authors tested this new method on two things:

  1. A fake math problem where they knew the exact answer.
  2. A real physics problem involving top quarks (tiny particles) where two neutrinos (ghost particles) escape detection, making the math very tricky.

The Shocking Result:
The models that looked like the "winners" under the old "Average" test (the ones that gave the single, flat, middle answer) were actually the worst at preserving the true shape of the data.

The models that gave the "messy" two-blob answers (the ones that looked worse under the old test) were actually the best at telling the truth.

The Takeaway

The paper concludes that how you measure success determines what you find.

If you only measure "how close is the guess to the truth," you will build models that erase the interesting, complex parts of reality. To get the right scientific answer, you have to stop asking for a single number and start asking for the full story of possibilities.

In short: Don't just ask, "How close were you?" Ask, "Did you tell the whole story?"

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →