Uncertainty in Physics and AI: Taxonomy, Quantification, and Validation

This paper presents a unified taxonomy and structured overview of uncertainty quantification in machine learning for physics, clarifying interpretations across statistical frameworks and outlining principled validation tools to ensure reliable probabilistic statements for scientific discovery.

Original authors: Manuel Haußmann, Ramon Winterhalder, Maria Ubiali

Published 2026-05-12
📖 7 min read🧠 Deep dive

Original authors: Manuel Haußmann, Ramon Winterhalder, Maria Ubiali

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: Why Guessing Isn't Enough

Imagine you are a physicist trying to discover a new particle, or a doctor using an AI to diagnose a disease. In both cases, getting the right answer is important, but knowing how sure you are about that answer is even more critical.

If an AI says, "There is a 99% chance this is a tumor," but it's actually just a shadow, that's dangerous. If a physicist says, "We found a new particle," but their math doesn't account for the "fuzziness" of their data, they might be wrong.

This paper is a guidebook for scientists and AI researchers. It argues that we need a common language to talk about uncertainty (the "fuzziness" or "doubt" in predictions) and strict rules to check if that uncertainty is being reported honestly.


1. The Dictionary of Doubt (Taxonomy)

The paper starts by pointing out that physicists and AI experts often use different words for the same things, which causes confusion. They propose a clear "dictionary" with two main axes to sort out uncertainty:

Axis A: Where does the doubt come from? (Source)

  • Statistical Uncertainty (The "Noise"): Imagine you are trying to guess the average height of people in a room by measuring just three people. Your guess might be off just because you didn't measure enough people. This is Statistical. If you measure 1,000 people, this doubt goes away.
  • Systematic Uncertainty (The "Broken Ruler"): Imagine you measure 1,000 people, but your ruler is actually 1 inch too short. No matter how many people you measure, your answer will always be wrong. This is Systematic. It comes from bad tools or wrong assumptions, not from a lack of data.

Axis B: Can we fix it? (Nature)

  • Aleatoric Uncertainty (The "Roll of the Dice"): This is randomness that is built into nature. Think of flipping a coin. Even if you know everything about the coin and the flipper, you can't predict the next flip. This is irreducible. You can't fix this by getting more data; it's just how the world works.
  • Epistemic Uncertainty (The "Missing Puzzle Piece"): This is doubt caused by a lack of knowledge. Imagine you are trying to solve a puzzle but you are missing half the pieces. If you get more pieces (more data) or a better picture of what the puzzle looks like (better theory), this doubt goes away. This is reducible.

The Paper's Key Insight: These categories overlap. For example, a "broken ruler" (Systematic) could be a "missing puzzle piece" (Epistemic) if we just don't know the ruler is broken yet. The paper provides a chart to help sort these out so scientists don't mix them up.


2. Two Ways of Thinking (Frequentist vs. Bayesian)

The paper explains that there are two main schools of thought on how to handle these doubts:

  • The Frequentist (The "Long-Run Gambler"): This approach asks: "If I repeated this experiment 1,000 times, how often would my answer be right?" They focus on Coverage. If they say "I am 95% confident," they mean that in 95 out of 100 repeated experiments, the true answer will fall inside their range.
  • The Bayesian (The "Belief Updater"): This approach asks: "Given what I knew before and what I just saw, how likely is my answer?" They start with a "prior belief" (a guess based on past experience) and update it with new data to create a "posterior" (the new, updated belief).

The paper notes that Particle Physics usually likes the Frequentist approach, while Cosmology often prefers the Bayesian one. Both are valid, but they speak different languages.


3. The Stress Test (Validation)

The most important part of the paper is about validation. Just because an AI says it is 95% confident doesn't mean it is 95% confident. The paper suggests three ways to "stress test" these AI predictions:

  • Coverage Tests (The "Safety Net"): If an AI draws a safety net (a prediction interval) saying it will catch the true answer 95% of the time, you check the net. If you drop 100 balls and the net only catches 80, the AI is lying (it's overconfident). If it catches 99, it's being too cautious.
  • Bias Tests (The "Center of Gravity"): Is the AI's best guess consistently shifted to the left or right? Imagine a dartboard. If the AI's darts are all clustered tightly but 2 inches to the left of the bullseye, it has bias. It's precise, but not accurate.
  • Scoring Rules (The "Report Card"): Instead of just checking if the AI was right or wrong, this gives the AI a score based on how well its entire probability map matches reality. It rewards the AI for being honest about its uncertainty. If the AI says "I'm 50/50" and it's actually 50/50, it gets a good score. If it says "I'm 100% sure" and it's wrong, it gets a terrible score.

4. The "Toy" Examples (What Happens in the Real World?)

The authors tested these ideas on simple math problems (regression and classification) to see how different AI methods behave.

  • The "Safe Zone" (Interpolation): When the AI is asked to predict something similar to what it has seen before (like predicting the weather in July based on July data), almost all methods work well. They all give similar answers and similar confidence levels.
  • The "Danger Zone" (Extrapolation): When the AI is asked to predict something it has never seen (like predicting the weather in July based only on January data), things get messy.
    • The Lesson: In the danger zone, the AI's confidence isn't based on data anymore; it's based on assumptions.
    • The Analogy: Imagine a map of a city. If you ask the AI to tell you the street name of a house you've never seen, but it's on a road you know, it can guess. But if you ask it to tell you the street name of a house in a completely different country, it has to guess based on what it thinks cities look like.
    • The Result: The paper found that in these "unknown" zones, different AI methods give wildly different answers and confidence levels. None of them were perfectly reliable. The uncertainty they reported was mostly a reflection of their internal "personality" (their mathematical assumptions) rather than actual knowledge.

Summary

This paper is a call for clarity and honesty in science.

  1. Stop mixing up words: Be clear about whether your doubt comes from noise (randomness) or ignorance (lack of data).
  2. Check your work: Don't just trust the AI's number. Use "coverage tests" and "bias tests" to see if the AI is actually telling the truth about its confidence.
  3. Beware of the unknown: When AI is asked to guess about things it hasn't seen, its confidence is a guess, not a fact. Scientists need to treat these "extrapolation" results with extreme caution.

The ultimate goal is to ensure that when AI helps make scientific discoveries, we know exactly how much we can trust the result.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →