⚛️ phenomenology

Amplitude Uncertainties Everywhere All at Once

This paper proposes and evaluates methods for generating ultra-fast, precise amplitude surrogates for LHC event generation by investigating noise reduction in network ensembles, establishing evidential regression as a sampling-free uncertainty quantification tool, and demonstrating that learned uncertainties effectively identify numerical noise and data gaps in amplitude regression.

Original authors: Henning Bahl, Nina Elmer, Tilman Plehn, Ramon Winterhalder

Published 2026-03-16

📖 6 min read🧠 Deep dive

CC BY 4.0

Original authors: Henning Bahl, Nina Elmer, Tilman Plehn, Ramon Winterhalder

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to predict the weather. You have a supercomputer that can calculate the temperature, wind speed, and rain probability for every spot on Earth. But, like any computer, it sometimes makes mistakes, especially in tricky areas like mountain ranges or storm fronts.

In the world of particle physics, scientists face a similar problem. They use complex math to predict how particles collide and bounce off each other (like tiny, invisible billiard balls). These calculations are so heavy that they take days or weeks to run on supercomputers. To speed things up for the Large Hadron Collider (LHC), scientists are training AI "surrogates"—smart shortcuts that learn to predict these collisions instantly.

But here's the catch: The AI needs to know when it doesn't know. If the AI predicts a particle collision with 99.9% confidence but is actually wrong, the whole experiment could be ruined. This paper is about teaching these AI surrogates to say, "I'm pretty sure," or "I'm totally guessing," and to be honest about why.

Here is a breakdown of the paper's main ideas using simple analogies:

1. The Problem: The "Confident Fool"

Imagine a student taking a math test.

The Old Way: The student gives an answer. Sometimes it's right, sometimes wrong. The teacher doesn't know if the student is guessing or if they actually know the answer.
The Goal: We want the student to give an answer and a confidence score. If they are 100% confident but wrong, that's a disaster. We need the AI to be "well-calibrated," meaning if it says "90% sure," it should be right 90% of the time.

2. The Three Methods Tested

The authors tested three different ways to teach the AI to measure its own uncertainty. Think of these as three different study groups:

A. Repulsive Ensembles (The "Debate Club")

How it works: Instead of one AI, you train 100 slightly different AIs. You force them to be different from each other (like telling 100 students to write essays on the same topic but forbidding them from copying each other).
The Logic: If all 100 AIs agree, you are confident. If they all give different answers, you know the answer is tricky, and your uncertainty is high.
The Paper's Discovery:
- The Good: This method is great at spotting "noise" (random errors in the data).
- The Bad: If the AI has a fundamental flaw (like a bad teacher), all 100 students might make the same mistake. The group thinks they are confident because they all agree, but they are all wrong. The paper found a way to fix this by teaching the group to admit, "Hey, we might all be biased," and adjusting their confidence accordingly.

B. Evidential Regression (The "Single Expert with a Diary")

How it works: Instead of 100 AIs, you have just one super-smart AI. But this AI doesn't just output a number; it outputs a "diary entry" about how much evidence it has seen.
The Logic: It's like a weather forecaster who says, "I predict rain, and I have seen 500 days of rain data to back this up." If they have seen very little data, they admit they are unsure.
The Paper's Discovery: This is much faster than the "Debate Club" because you only run one AI. It works surprisingly well, almost as good as the 100-AI group, but it sometimes struggles to draw sharp lines around "tricky zones" (like sudden changes in particle behavior).

C. Bayesian Neural Networks (The "Gambler's Intuition")

How it works: This is a classic method where the AI treats its own internal settings as a game of chance, constantly updating its "belief" about the answer.
The Paper's Discovery: It performed very well, acting as a solid benchmark. It was good at spotting when data was missing.

3. The "Tricky Zones" (Where the AI gets confused)

The authors tested these methods in three specific "nightmare scenarios" to see if the AI could handle them:

Scenario 1: The "Fuzzy Box" (Flat Noise)
- Analogy: Imagine a region on a map where the GPS signal is slightly staticky.
- Result: All three methods realized, "Hey, this area is fuzzy," and raised their uncertainty alarms. They did a great job.
Scenario 2: The "Spiky Peak" (Peaked Noise)
- Analogy: Imagine a mountain peak where the GPS signal gets terrible only right at the very top, but is perfect everywhere else.
- Result: The "Debate Club" (Ensembles) and the "Gambler" (Bayesian) were the best at spotting this sharp spike in confusion. The "Single Expert" (Evidential) was okay but missed the sharpest edges.
Scenario 3: The "Missing Map" (Data Gaps)
- Analogy: Imagine a blank spot on the map where no data was ever collected.
- Result: This is the hardest test. The AI has to guess what's in the blank spot.
- The Surprise: The AI managed to guess the answer quite well because the "terrain" (the physics) was smooth and flat in that area. However, the AI correctly shouted, "I'm guessing here! My uncertainty is huge!" This is exactly what we want. It didn't pretend to know the answer; it admitted it was in the dark.

4. The Big Takeaway

The paper concludes that there is no single "perfect" method, but we now have a better toolkit:

If you have time and computing power: Use the "Debate Club" (Repulsive Ensembles). It's the most reliable at spotting when the AI is confused or biased.
If you need speed: Use the "Single Expert" (Evidential Regression). It's fast and usually accurate, though it needs a little tuning to handle sharp edges.
The Golden Rule: The most important thing is that these AI surrogates can now tell us when they are unsure. This allows physicists to trust the AI for routine calculations but know exactly when to double-check the math manually.

In short: The authors taught the AI to stop pretending it knows everything. By giving the AI a "honesty meter," they are making the future of particle physics faster, safer, and more reliable.

1. Problem Statement

In high-energy physics, specifically for the High-Luminosity Large Hadron Collider (HL-LHC), generating precise Monte Carlo event simulations is computationally expensive. Machine learning (ML) surrogates are increasingly used to approximate complex scattering amplitudes. However, for these surrogates to be reliable, they must not only predict the mean amplitude with high precision but also provide calibrated local uncertainty estimates.

The paper addresses three critical challenges:

Calibration of Systematic Uncertainties: Existing methods, particularly Repulsive Ensembles (REs), often fail to correctly calibrate systematic uncertainties in specific regions of phase space due to model biases.
Efficiency vs. Accuracy: Traditional Bayesian Neural Networks (BNNs) and Ensembles are computationally expensive. The authors investigate Evidential Regression (ER) as a sampling-free alternative.
Localized Data Deficiencies: Real-world amplitude calculations often suffer from numerical noise or gaps (e.g., near physical thresholds). The study tests whether ML models can identify and quantify these localized disturbances.

2. Methodology

The authors benchmark three probabilistic approaches for amplitude regression on the partonic process $gg \to \gamma\gamma g$ :

A. Repulsive Ensembles (RE)

Mechanism: Trains an ensemble of neural networks where a repulsive kernel term in the loss function prevents members from collapsing to the same minimum, thereby encoding statistical uncertainty via the spread of predictions.
Bias Analysis: The authors investigate whether the ensemble mean reduces bias. They find that while ensembles reduce statistical noise, they do not eliminate systematic biases caused by limited model expressivity (e.g., insufficient network depth).
Systematic Uncertainty Correction: They propose a novel two-step or combined training method where a separate network learns the systematic uncertainty ( $\sigma_{syst}$ ) specifically for the ensemble mean, rather than averaging individual member uncertainties. This corrects miscalibration caused by the mismatch between ensemble variance and ensemble mean accuracy.

B. Evidential Regression (ER)

Mechanism: Instead of sampling weights, ER places a prior (Normal-Inverse-Gamma distribution) over the likelihood parameters (mean and variance). The network outputs evidential parameters ( $\gamma, v, \alpha, \beta$ ) which are analytically integrated to yield a Student-t predictive distribution.
Advantage: This is a sampling-free method, making it computationally efficient.
Degeneracy Handling: The authors compare two strategies to break the degeneracy in the four output parameters:
1. Adding a regularization loss ( $L_R$ ) penalizing high evidence for incorrect predictions.
2. Constraining the relationship between parameters (e.g., $2\alpha = v$ ).
  They find the constraint method yields slightly better results.

C. Bayesian Neural Networks (BNN)

Used as a benchmark (based on previous work) to compare against RE and ER, particularly in localized noise scenarios.

D. Localized Learning Challenges

The authors simulate realistic data issues by introducing:

Flat-box smearing: Gaussian noise applied to a specific invariant mass range.
Peaked smearing: Noise that increases sharply near a threshold.
Threshold gaps: Complete removal of training data in a specific phase-space region.

3. Key Contributions

Diagnosis of Ensemble Bias: The paper demonstrates that Repulsive Ensembles inherit the systematic bias of individual network members. If the network architecture lacks expressivity, the ensemble mean remains biased, and simple averaging of uncertainties leads to miscalibration.
Global Systematic Uncertainty Learning: A new method is proposed to learn $\sigma_{syst}$ directly for the ensemble mean. This resolves the miscalibration observed in REs, particularly for large ensemble sizes where the bias becomes the dominant error source.
Evidential Regression Validation: ER is established as a robust, efficient alternative to ensembles. The study shows that constraining evidential parameters ( $2\alpha = v$ ) is superior to using regularization losses for this specific physics task.
Localization Capabilities: The study proves that all three methods (RE, ER, BNN) can successfully identify and quantify localized numerical noise and data gaps.
- RE and BNN show superior ability to disentangle smooth underlying amplitudes from localized noise via interpolation.
- ER performs well but struggles slightly with sharp edges in noise profiles compared to RE.
- In data gap scenarios, RE and BNN correctly predict increased statistical uncertainty in the missing region, whereas ER tends to produce flat uncertainty estimates.

4. Key Results

Uncertainty Calibration:
- Standard REs miscalibrate systematic uncertainties when model bias is present. The proposed "global $\sigma_{syst}$ " training fixes this, yielding well-calibrated pulls ( $t_{syst} \sim \mathcal{N}(0,1)$ ) for small ensembles.
- For large ensembles ( $N_{ens} > 100$ ), residual non-Gaussian biases cause peaks in the pull distribution at $\pm 1$ , indicating the Gaussian likelihood assumption breaks down for the bias component.
Accuracy:
- All methods achieve mean relative accuracy of $\langle |\Delta| \rangle \sim 3 \times 10^{-5}$ on clean data.
- Ensemble accuracy scales as $1/\sqrt{N_{ens}}$ until hitting a "bias floor" determined by network expressivity.
Localized Noise (Smearing):
- Flat Box: All methods identify the smeared region. RE follows the expected noise profile most closely; ER and BNN provide better-calibrated pulls.
- Peaked Threshold: RE and BNN successfully capture the sharp increase in noise near the threshold. ER struggles to capture the peak magnitude accurately.
Data Gaps:
- Networks maintain reasonable accuracy in gap regions due to interpolation capabilities (the amplitude is flat in the gap).
- RE and BNN correctly predict a localized spike in statistical uncertainty within the gap.
- BNN shows a global increase in uncertainty, while RE keeps the increase localized.
- ER fails to capture the localized nature of the gap uncertainty, predicting a flat uncertainty profile.

5. Significance

This work provides a critical "toolkit" for the next generation of particle physics simulations:

Reliability: It clarifies that simply using ensembles is insufficient for uncertainty quantification if the underlying model is biased; specific training strategies for systematic uncertainty are required.
Efficiency: It validates Evidential Regression as a viable, fast alternative to ensembles for many scenarios, though it highlights limitations in handling sharp, localized data gaps.
Robustness: The methods are proven to be robust against the types of numerical instabilities (threshold smearing, gaps) common in loop-induced amplitude calculations.
Future Impact: These insights are essential for developing "surrogate" models that can replace traditional, slow calculations in High-Luminosity LHC event generators, ensuring that physics analyses remain statistically sound and systematic uncertainties are under control.