Under-coverage in high-statistics counting experiments… — Plain-Language Explanation

Original authors: Cristina-Andreea Alexe, Joshua Bendavid, Lorenzo Bianchini, Davide Bruschini

Published 2026-02-09

📖 5 min read🧠 Deep dive

Original authors: Cristina-Andreea Alexe, Joshua Bendavid, Lorenzo Bianchini, Davide Bruschini

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery: How many times did a specific event happen? (Let's say, how many times a rare particle was created in a giant collider).

To solve this, you have two tools:

Real Evidence: A huge pile of data collected from the actual experiment (the "Data").
Theoretical Map: A computer simulation that predicts what the data should look like if your theory is correct (the "Monte Carlo" or MC).

Usually, scientists assume that if they have a lot of data and a lot of simulation, their math will be perfect. They use a standard "ruler" (called the Profile-Likelihood Ratio) to draw a confidence interval—a range where they are 68% sure the true answer lies.

The Paper's Big Discovery:
The authors of this paper found that even when you have massive amounts of data and simulation, this standard "ruler" is actually broken. It gives you a range that is too narrow. It makes you feel more confident than you should be. In statistics, this is called under-coverage. It's like a weather forecaster saying there is a 99% chance of sunshine, but it rains anyway.

Here is the breakdown of why this happens, using simple analogies:

1. The "Fuzzy Map" Problem

Imagine your "Theoretical Map" (the simulation) isn't a perfect, high-definition photo. Because computers can't run infinite simulations, the map is made of a finite number of pixels. These pixels have a little bit of "static" or "noise" (statistical fluctuations).

The Old Assumption: Scientists thought, "If we have enough real data, the noise in our map doesn't matter."
The Reality: The paper shows that the noise in the map interacts with the noise in the real data in a tricky way. It's like trying to measure the length of a table using a ruler that is slightly wobbly. Even if you measure the table a million times, if the ruler itself is shaky, your final measurement will be wrong.

2. The "Tightrope" Analogy

The paper uses a toy model to explain this. Imagine you are trying to balance two weights on a tightrope:

Weight A: The Signal (the rare particle you want to find).
Weight B: The Background (common noise that looks like the signal).

These two weights are highly correlated. If you move one, the other has to move to keep the balance. The math gets very sensitive here.

Because the "Map" (simulation) has noise, the scientists' calculation of how sensitive the balance is becomes artificially sharp. The math thinks, "Oh, I know exactly where the balance point is!" but it's actually just an illusion caused by the noise in the map. This makes the calculated "confidence interval" (the safety zone) shrink too much.

3. Why "More Data" Doesn't Always Fix It

You might think, "If I just get more simulation data, the map becomes perfect, and the problem goes away."

The Paper says: Yes, eventually, if you have enormous amounts of simulation data (much more than the real data), the problem disappears.
The Catch: In real-world physics (like at the Large Hadron Collider), getting that much simulation data is often too expensive or takes too long. So, scientists are stuck with "fuzzy maps."

4. The "Broken Ruler" Tests

The authors tested many different ways to fix the math:

Standard Methods: Failed (too narrow).
Complex "Feldman-Cousins" Methods: These are more rigorous statistical tools that don't rely on the "perfect ruler" assumption. The authors tried them, but they also failed to give the correct coverage when the simulation had noise. The noise in the map messed up even these advanced tools.

5. The Proposed "Heuristic" Solution

Since the perfect mathematical solution is too hard to calculate for real-world problems, the authors propose a practical hack (a heuristic).

Think of it like this:

Calculate the uncertainty using the standard "wobbly ruler" (which is too small).
Calculate what the uncertainty would be if the map were perfect (using a specific formula).
Mix them together using a specific recipe (Equation 26 in the paper).

This "mixed" uncertainty is wider and more honest. It acts as a safety net, ensuring that when scientists say they are 68% confident, they actually are 68% confident, even with a noisy simulation.

Summary

The Problem: In high-stakes physics experiments, using finite computer simulations to model data causes standard statistical methods to be overconfident. They claim to know the answer better than they actually do.
The Cause: The "noise" in the computer simulation interacts with the data in a way that tricks the math into thinking the answer is more precise than it is.
The Solution: Don't trust the standard math blindly. Use a new, practical formula that combines different types of uncertainty estimates to widen the safety zone and get the coverage right.

The paper essentially warns physicists: "Just because you have a lot of data doesn't mean your math is asymptotic (perfect). If your computer simulations are finite, your confidence intervals are likely too tight, and you need to adjust for it."

Technical Summary: Under-coverage in High-Statistics Counting Experiments with Finite MC Samples

Problem Statement
This paper addresses the problem of setting confidence intervals (CIs) for a parameter of interest (POI) in high-statistics, binned counting experiments where the physics model is derived from finite-size Monte Carlo (MC) simulated samples. While standard statistical inference in particle physics often relies on the asymptotic properties of maximum-likelihood estimators (MLEs)—specifically Wilks' theorem for the profile-likelihood ratio (PLR) and the Hessian matrix for uncertainties—this work investigates whether these approximations hold when MC samples are finite, even when both data and simulation event counts are large.

The core issue identified is systematic under-coverage: confidence intervals constructed using standard asymptotic methods (e.g., Hessian uncertainties or PLR with Wilks' theorem) fail to contain the true parameter value at the claimed confidence level (e.g., 68.3%). This occurs despite the presence of nuisance parameters (NPs) modeling systematic uncertainties and finite MC statistics, a scenario common in precision measurements like the W boson mass determination at the LHC.

Methodology
The authors employ a two-pronged approach: a detailed numerical study using a "paradigmatic toy model" and a general analytical derivation.

Toy Model:
- A hypothetical experiment is constructed with $n$ histogram bins, large event counts per bin ( $y_i \gg 1$ ), and a model describing signal and background processes.
- The model parameters include a POI ( $\mu$ ) and a nuisance parameter ( $\theta$ ).
- Crucially, the expected event counts are not known analytically but are predicted by MC samples of finite size ( $t_{ji}$ ), introducing statistical fluctuations.
- The study compares various CI-setting methods:
  - Asymptotic methods: Hessian uncertainty and PLR based on the Barlow-Beeston (BB) likelihood (full and "lite" versions).
  - Non-asymptotic methods: Profiled Feldman-Cousins (FC), Simplified FC, Cousins-Highlands (CH), and Bartlett-corrected PLR.
- Coverage is evaluated by generating $10^4$ pseudo-experiments and checking the fraction where the true parameter lies within the calculated interval.
General Analytical Framework:
- The authors derive the behavior of the profile-likelihood ratio in the Gaussian approximation for large event counts.
- They treat the statistical fluctuations of MC templates as perturbations to the Jacobian matrix of the model function with respect to the POI and nuisance parameters.
- Using a perturbative expansion, they analyze the bias introduced into the quadratic form $S$ (which relates to the inverse variance of the estimator) by the finite size of MC samples.

Key Results

Breakdown of Asymptotics: Even with large event counts per bin ( $y_i \sim 10^4$ ) and MC samples comparable to or larger than the data, standard asymptotic methods (Hessian and PLR) exhibit significant under-coverage. The Barlow-Beeston "lite" approximation, which treats MC uncertainty as a simple rescaling of data variance, fails to restore correct coverage.
Failure of Non-Asymptotic Alternatives: Methods that do not rely on Wilks' theorem, such as the Profiled Feldman-Cousins approach, also suffer from under-coverage. The authors attribute this to the difficulty of handling nuisance parameters (specifically those related to MC fluctuations) in the construction of the acceptance region.
Source of Bias: The analytical study reveals that statistical fluctuations in the MC templates induce a positive bias in the estimated inverse variance ( $\hat{S}$ $\hat{S}$ ).
- This bias arises from fluctuations in the Jacobian matrix components ( $A$ and $b$ ).
- The bias is particularly severe when the POI is highly correlated with nuisance parameters (high global correlation coefficient $\rho_\mu$ ).
- The bias term is not simply proportional to $1/k$ (where $k$ is the MC-to-data ratio), explaining why simple rescaling methods (like BB-lite) are insufficient.
Recovery Conditions: Correct coverage is only restored in the limit where the MC statistical power is extremely large relative to the data (e.g., $k \approx 40$ in the toy model) or when the number of bins is significantly reduced.
Heuristic Solution: The authors propose a heuristic confidence interval (Eq. 25) that combines the Hessian uncertainty from the full Barlow-Beeston likelihood with the asymptotic uncertainty from infinite MC statistics. This heuristic interval demonstrates coverage properties much closer to the ideal Feldman-Cousins construction across various model configurations.

Significance and Claims
The paper claims that the validity of asymptotic approximations (Wilks' theorem) in binned profile-likelihood analyses cannot be assumed solely based on the absolute number of events in data or simulation bins.

Systematic Under-coverage: The authors demonstrate that finite MC statistics introduce a systematic bias that leads to under-coverage, a problem that persists even in high-statistics regimes relevant to current LHC analyses.
Limitations of Standard Corrections: Popular approximations like the Barlow-Beeston "lite" method are shown to be insufficient for correcting this under-coverage because the bias mechanism is more complex than a simple variance rescaling.
Practical Tests: The paper proposes practical tests for experimentalists:
1. Scaling Test: Estimating the asymptotic uncertainty $\bar{\sigma}_H$ by analyzing the scaling of the Hessian uncertainty with the MC sample size (Eq. 48). A significant difference between the finite-sample uncertainty and the extrapolated infinite-sample uncertainty signals the presence of spurious constraints.
2. Lite vs. Full Comparison: Comparing the uncertainty from the BB-lite method against the analytical prediction for the full BB method (Eq. 50) to verify if the lite approximation is adequate.

The authors conclude that while the full Barlow-Beeston method is the theoretically correct approach for finite MC samples, its implementation is often computationally challenging. Therefore, researchers must carefully verify the asymptotic regime of their analyses, particularly when nuisance parameters are profiled, as the "large statistics" assumption may be violated by the interplay between data and finite MC fluctuations.

Under-coverage in high-statistics counting experiments with finite MC samples