Conformal calibration and look-elsewhere effect in… — Plain-Language Explanation

Imagine you are a detective trying to find a specific type of counterfeit coin hidden inside a massive bag of genuine ones. You have a new, high-tech "anomaly detector" (a machine learning model) that gives every coin a "weirdness score." The higher the score, the more likely it is a fake.

The problem is that this detector is like a wild guesser. It gives you a score like "17.5," but that number means nothing on its own. Is 17.5 rare? Is it common? Without a ruler to measure it, you can't tell if you've found a fake or just a normal coin that happened to look a bit odd.

Furthermore, because the detector scans thousands of coins, it's bound to find a few that look "weird" just by pure luck. If you don't account for how many times you looked, you might think you found a fake when you actually just got lucky.

This paper proposes a new "calibration layer" to fix these problems. Here is how it works, using simple analogies:

1. The Broken Ruler (The Calibration Problem)

Imagine your detector is a scale that tells you how heavy a coin is, but the scale is broken. It says a normal coin weighs 17.5 grams. You don't know if that's heavy or light because you haven't weighed a bunch of known normal coins first to set the baseline.

The authors use a statistical tool called Conformal Prediction to build a new ruler. They take a pile of coins they know are normal (the "calibration set") and see how the detector scores them. Then, they map the detector's raw scores to a p-value.

The Analogy: Instead of saying "This coin is 17.5 weird," the new ruler says, "Only 1% of normal coins look this weird." Now you have a clear, honest number.

2. The "Look-Elsewhere" Trap

If you scan a whole bag of coins, you will eventually find one that looks slightly unusual just by chance. If you scan 1,000 coins, finding one "weird" one isn't a big deal. But if you only looked at one coin, it would be huge news.

The paper combines their new ruler with a method called the Gross–Vitells correction.

The Analogy: This is like a judge who knows you flipped a coin 1,000 times. If you say, "I got heads 10 times in a row!" the judge doesn't just look at that streak; they look at the whole 1,000 flips. They calculate the odds of getting that streak anywhere in the bag. This prevents you from crying "Fake Coin!" just because you got lucky.

3. The "Sculpting" Scam (The Exchangeability Failure)

This is the paper's biggest discovery. In particle physics, scientists often use "sidebands" (areas next to the target area) to guess what the background looks like. They assume the background in the sidebands is the same as the background in the target area.

The authors found that in many machine learning models, this assumption is false. The model learns to use features that are secretly linked to the location.

The Analogy: Imagine you are looking for a fake coin in a specific jar. To calibrate your detector, you look at coins in a jar next to it. But your detector has learned that "coins in the left jar are usually heavier" and "coins in the right jar are usually lighter." Even if all coins are real, your detector will think the coins in the right jar are "weird" just because they are in the right jar.
The Result: Without fixing this, the detector creates a "ghost signal." In the paper's test, this "ghost" looked like a 46-sigma discovery (which is astronomically huge, like finding a needle in a galaxy). It was a complete illusion caused by the detector's bias.

4. The Fix: The "Weighted" Correction

The authors fix this by applying a weight to the calibration.

The Analogy: They realize the "left jar" and "right jar" coins are slightly different. So, when they use the left jar to calibrate the right jar, they give the left-jar coins a "discount" or "adjustment" so they match the right jar's profile.
The Outcome: When they apply this weight, the fake 46-sigma signal disappears completely. It drops to 0.2 sigma, which is just normal background noise. The detector stops lying.

5. The "Fail-Safe" Feature

One of the best things about this method is that it is honest even when things go wrong.

The Analogy: If your calibration coins are secretly contaminated with a few fakes, a standard detector might silently start screaming "Fake!" and you'd never know. But this new method has a self-check. If the calibration is bad, the "ruler" will look crooked (the p-values won't be uniform). It will say, "Hey, my ruler is broken," rather than giving you a false discovery.

Summary of Results

The authors tested this on public data from the LHC (Large Hadron Collider):

Standard Methods: When they used standard techniques on this data, the detector invented fake signals of 10-sigma or 5-sigma in areas where no signal existed. It was hallucinating discoveries.
The New Method: When they added their calibration layer, those fake signals vanished. The detector correctly reported "No signal found" (a null result).
Real Signals: When they did put a real signal in, the method could still find it (if the signal was strong enough), proving it didn't just "turn off" the detector; it just stopped it from lying.

The Bottom Line:
This paper doesn't invent a new particle detector. Instead, it invents a truth-telling layer that sits on top of any detector. It ensures that when a detector says "We found something," it actually means "We found something," and not just "We got lucky" or "Our math was biased." It turns a raw, confusing score into a defensible, auditable scientific statement.

Technical Summary: Conformal Calibration and Look-Elsewhere Effect in Anomaly Detection for New-Physics Searches

Problem Statement
Machine-learned anomaly detection (AD) has become a primary strategy for searching for physics beyond the Standard Model. However, the statistical interpretation of AD scores has lagged behind their development. A raw anomaly score lacks calibrated meaning; a value does not inherently convey the probability of a background fluctuation. Furthermore, flexible models scanning multiple regions, observables, and latent directions suffer from an acute "look-elsewhere effect" (multiplicity), inflating false discovery rates. Existing experimental workflows rely on asymptotic profile-likelihood formulae and trials factors (e.g., Gross–Vitells theory) that assume a correctly modeled background. These methods are blind to background mismodeling, a failure mode to which AD is particularly prone. When training and evaluation data are shared or when features correlate with the resonant variable (e.g., invariant mass), standard pipelines produce miscalibrated $p$ -values, potentially manufacturing false discoveries.

Methodology
The authors propose a calibration layer built on conformal prediction that transforms any anomaly score into a defensible significance with distribution-free, finite-sample guarantees. The methodology proceeds through several key stages:

Split Conformal Calibration: The authors define a one-sided conformal $p$ -value, $\hat{p}(s)$ , for a test score $s$ based on a calibration set of $n$ background-only scores. This maps raw scores to $p$ -values such that, under exchangeability, the $p$ -values are super-uniform ( $P(\hat{p} \le \alpha) \le \alpha$ ). This provides a finite-sample guarantee independent of the score distribution's shape.
Addressing Exchangeability Failures: Resonant searches often violate the exchangeability assumption because the background score distribution in the signal region (SR) differs from the sidebands (SB) due to correlations between jet substructure features and the resonant variable (mass).
- Weighted Conformal Prediction: To correct for this covariate shift, the authors employ a weighted conformal $p$ -value using a likelihood ratio $w(x) = dQ/dP$ (where $Q$ is the SR distribution and $P$ is the SB distribution). This weight is estimated label-free from the data.
- Mondrian Calibration: For heterogeneity where the background varies across bins of the resonant variable, the authors suggest Mondrian (group-conditional) calibration, which calibrates separately within each bin to ensure local validity.
Robustness to Contamination: The framework addresses signal leakage into control regions. Theorem 5 establishes that if signal contamination in the calibration set is stochastic (signal events have higher scores than background), the procedure remains valid and becomes conservative, failing safe rather than producing false alarms.
Look-Elsewhere Correction: The local conformal $p$ -values are aggregated into a count field $Z(m)$ across scanning windows. The authors apply the Gross–Vitells up-crossing theory to this field to compute a global significance. While the local $p$ -values have finite-sample guarantees, the global step is treated as an asymptotic bound, validated against background-only pseudoexperiments.
False Discovery Rate (FDR) Control: For multi-region shortlists, the Benjamini–Hochberg procedure is integrated to control the FDR, leveraging the positive dependence of conformal $p$ -values derived from a shared calibration set.

Key Contributions

A Calibration Layer: The paper introduces a modular layer that can be applied to any existing anomaly detector without retraining the detector itself. It converts uncalibrated scores into valid local $p$ -values.
Diagnosis and Correction of Exchangeability: The method provides a diagnostic tool (checking for uniformity of background $p$ -values) to detect exchangeability failures caused by feature-mass correlations. It offers a label-free weighted correction to restore validity.
Finite-Sample Guarantees: Unlike asymptotic methods, the conformal layer offers rigorous finite-sample validity that is robust to background mismodeling, provided the assumptions (exchangeability or correctable covariate shift) are met.
Integration with Trials Factors: The work bridges the gap between conformal prediction and high-energy physics (HEP) discovery statistics by combining finite-sample local calibration with the Gross–Vitells global significance framework.

Results
The methodology was tested on the LHC Olympics 2020 R&D dataset (QCD dijet background with an injected $Z' \to XX$ resonance).

Detection of Miscalibration: On real data, a standard sideband-calibrated classifier exhibited a significant exchangeability failure. The background $p$ -values were anti-conservative, with $P(\hat{p} \le 0.05) \approx 0.087$ instead of the nominal 0.05.
Correction of False Excesses:
- A naive counting of events with $p \le 0.05$ in the signal region yielded a spurious $\sim 46\sigma$ excess.
- Applying the label-free weighted correction restored the background rate to nominal, reducing the significance to an honest null ( $Z \approx 0.2$ ).
- In a blind wide-mass scan (retraining the detector in each window), standard asymptotic and unweighted conformal procedures fabricated $\gtrsim 10\sigma$ excesses in signal-free windows. The weighted conformal layer produced no false alarms, with global significances consistent with the null.
Validation of Global Significance: The global false-positive rate of the weighted conformal procedure was verified on background-only pseudoexperiments, showing empirical control near the nominal level.
Signal Recovery: In a positive control study with stronger signal injections ( $S/B \approx 1.3\%$ ) and minimal sideband contamination, the weighted chain successfully recovered a $\sim 7.4\sigma$ global significance, demonstrating that the method does not suppress genuine signals, only corrects for systematic biases.

Significance and Claims
The paper claims to provide an auditable, detector-agnostic path from an uncalibrated anomaly score to a trials-factor-aware global significance.

The primary value is not a new detector, but a calibration and significance layer that makes assumptions explicit and checkable.
It exposes "silent" failures (like background sculpting) that standard asymptotic pipelines miss, converting them into visible non-uniformities or correcting them via weighting.
The authors emphasize that while the local $p$ -values have finite-sample guarantees, the global significance relies on asymptotic assumptions (Gross–Vitells) which are empirically validated in their study.
The work highlights that the "look-elsewhere effect" in AD is exacerbated by the multiplicity of regions and the correlation between features and the resonant variable, and that conformal prediction offers a rigorous framework to address these specific failure modes.

The paper concludes that while the method does not solve all background systematics (e.g., unknown unparameterized mismodeling), it significantly improves the reliability of AD searches by ensuring that reported significances are not artifacts of calibration failures. Future work is identified as integrating nuisance parameters (detector systematics) into the conformal framework and comparing this approach directly with mass-decorrelated detectors.

Conformal calibration and look-elsewhere effect in anomaly detection for new-physics searches