Model-Agnostic Signal Discovery with Machine Learning:… — Plain-Language Explanation

The Big Picture: Finding a Needle in a Haystack Without Knowing What the Needle Looks Like

Imagine you are a detective looking for a new type of criminal in a massive city.

The Old Way (Model-Dependent): You have a specific suspect in mind. You know they wear a red hat and drive a blue car. You set up roadblocks specifically to catch people with red hats and blue cars. This is very efficient if your suspect is exactly who you think they are. But if the criminal wears a green hat and drives a truck, you will miss them completely.
The New Way (Model-Agnostic): You don't know what the criminal looks like. Instead, you hire a super-smart AI to scan the entire city and flag anything that looks "weird" or "out of place" compared to the normal crowd. This AI doesn't care about red hats or blue cars; it just looks for patterns that don't fit the background noise.

This paper is a guidebook for physicists (specifically those at the Large Hadron Collider) on how to use these "weirdness detectors" (Machine Learning) to find new physics without needing a specific theory to guide them.

The Core Problem: The "Background" Noise

In physics experiments, most data is just "background noise"—ordinary events we already understand (like standard particle collisions). Occasionally, a "signal" (a new particle or phenomenon) appears.

The Challenge: The signal is often very faint, hidden inside the noise.
The Limitation: If you only look for specific signals you already predicted, you might miss something totally unexpected.
The Solution: Use AI to learn what "normal" looks like, and then flag anything that breaks the rules of normality.

The Three Main Tools (The "Detectives")

The paper categorizes the new AI methods into three main strategies:

1. The "Two-Sample Test" (The Side-by-Side Comparison)

Analogy: Imagine you have two jars of marbles.

Jar A: Contains marbles from a factory you trust (the "Reference" or "Background").
Jar B: Contains marbles from a new, unknown source (the "Data").
The Method: You use an AI to compare the two jars. It doesn't need to know what a new marble looks like. It just asks: "Are these two jars made of the same stuff?" If the AI finds a significant difference, it sounds the alarm.
The Paper's Example (NPLM): This is like a "Goodness-of-Fit" test. The AI learns to spot the difference between the known background and the new data. It's powerful because it's very flexible, but it requires a very high-quality "Jar A" (a perfect simulation of the background).

2. Outlier Detection (The "Odd One Out" Game)

Analogy: Imagine a crowded party where everyone is wearing a tuxedo.

The Method: You train an AI on photos of people in tuxedos. Then, you show it a new photo. If the photo shows someone in a clown suit, the AI says, "That doesn't look like a tuxedo!"
How it works: The AI learns the "shape" of normal data. If a data point is hard to compress or reconstruct (like trying to squeeze a square peg into a round hole), it gets a high "anomaly score."
The Catch: The paper warns that this depends heavily on how you describe the data. If you change the way you measure things (like switching from inches to centimeters), the AI might think a "normal" person is weird just because of the math, not because they are actually weird.

3. Weak Supervision (The "Teacher Without a Textbook")

Analogy: Imagine you want to find counterfeit bills, but you don't have any real counterfeit bills to show your AI. You only have a pile of mixed money.

The Trick: You take two piles of mixed money. You know for a fact that Pile 1 has a slightly higher chance of having a fake bill than Pile 2 (maybe Pile 1 came from a shady vending machine).
The Method: You ask the AI to tell Pile 1 apart from Pile 2. Since the only real difference is the amount of fake bills, the AI is forced to learn what a fake bill looks like to solve the puzzle.
The Paper's Example (Dijet Resonances): In particle physics, they look for a specific "mass" window where a new particle might hide. They train the AI to distinguish the "signal window" from the "side windows" (background). If the AI gets good at this, it has learned to spot the new particle without ever seeing a labeled example of it.

The Pitfalls and How to Avoid Them

The paper spends a lot of time warning us about traps, much like a safety manual for a new machine.

The "Mass Sculpting" Trap:
- The Problem: Sometimes, the AI gets confused and starts flagging things based on the wrong reason. For example, if the AI learns that "heavy things" are weird, it might accidentally flag all heavy particles as "new physics," creating a fake signal where none exists.
- The Fix: You have to "decorrelate" the AI. You force it to ignore certain features (like the mass) while it learns, so it only looks for the shape of the anomaly, not just the weight.
The "Overfitting" Trap:
- The Problem: If you train the AI on the same data you are trying to test, it might just memorize the noise and think it found a signal.
- The Fix: Use "Cross-Validation." Split your data into pieces. Train the AI on Piece A, test it on Piece B. Then switch. This ensures the AI is actually learning patterns, not memorizing the dataset.
The "False Alarm" Problem:
- The Problem: Because these methods look at everything, they might find a "weird" pattern that is just a random fluke (statistical noise).
- The Fix: The paper emphasizes rigorous validation. You must test the AI on "fake data" (simulations) where you know there is no signal. If the AI still screams "Signal!", your method is broken.

What Happens If You Find Something?

If the AI finds a "weird" event, what do you do next?

Don't celebrate yet. You have to figure out why it was weird. Was it a new particle, or was it a glitch in the detector?
Interpretation: The paper suggests using tools to see which features the AI was looking at. Did it flag the event because of its speed? Its shape? This helps physicists understand the nature of the anomaly.
Follow-up: Once you know what the anomaly looks like, you can run a traditional, highly specific search (the "Old Way") to confirm it.
- Crucial Note: You cannot use the same data to both find the anomaly and confirm it. That would be like a detective arresting a suspect based on a hunch and then using that same hunch as proof in court. You need a fresh dataset to confirm the discovery.

Summary

This paper is a "User Manual" for a new generation of physics searches. It tells scientists:

How to build AI that looks for the unknown.
How to avoid fooling yourself with fake signals.
How to prove that what you found is real and not just a glitch.

It bridges the gap between the rigid, theory-driven searches of the past and the flexible, data-driven exploration of the future.

Technical Summary: Model-Agnostic Signal Discovery with Machine Learning

Problem Statement
Searches for new phenomena in high-energy physics (HEP) and related fields are traditionally model-dependent, optimizing analyses for specific hypotheses (e.g., specific particle masses or decay modes). While powerful for targeted scenarios, these methods suffer from limited coverage of the broader space of possible signals, particularly when theoretical guidance is scarce or Monte Carlo simulations are unreliable. Conversely, broad, model-independent approaches often lack the sensitivity of dedicated searches. The field lacks established standards for validating and interpreting new machine learning (ML)-driven, model-agnostic strategies that aim to bridge this gap. This document addresses the need for a conceptual framework, validation protocols, and interpretation strategies for these emerging techniques.

Methodology and Framework
The paper categorizes model-agnostic search strategies into two primary families based on their statistical formalism and assumptions:

Two-Sample Hypothesis Testing:
- Concept: These methods treat the search as a collective anomaly detection problem, testing whether the observed data distribution ( $p_{data}$ ) differs from a reference background distribution ( $p_b$ ). They do not assume a specific signal model ( $p_s$ ).
- Techniques: The review highlights ML-based classifiers trained to distinguish observed data from reference samples (e.g., Monte Carlo simulations). These classifiers learn a monotonic transformation of the likelihood ratio, effectively approximating the optimal Neyman-Pearson test statistic without a predefined signal hypothesis.
- Case Study (NPLM): The New Physics Learning Machine (NPLM) is presented as a representative example. It performs a Goodness-of-Fit test by learning an alternative hypothesis directly from data as a local deformation of the background. Crucially, NPLM incorporates systematic uncertainties by treating nuisance parameters as part of a composite hypothesis, using profile likelihood-ratio constructions to ensure robustness against mismodeled backgrounds.
Model-Agnostic Signal Selection (Anomaly Detection):
- Concept: These methods function as anomaly detectors, assigning scores to events to identify subsets enriched in signal, rather than performing a full statistical test immediately.
- Outlier Detection: Methods like autoencoders (VAEs) or normalizing flows learn the background distribution $p_b(z)$ . Events with low reconstruction probability or low likelihood under the learned density are flagged as anomalies. The paper notes fundamental limitations here, such as coordinate transformation invariance and "complexity bias" (where complex data is scored as anomalous regardless of signal presence).
- Weak Supervision: Techniques like Classification Without Labels (CWoLA) train classifiers to distinguish between two mixed samples ( $M_1$ and $M_2$ ) where the signal fraction differs ( $f_1 > f_2$ ) but the background distribution is identical. The classifier learns the signal-to-background ratio. This is often applied to resonance searches where the signal is localized in a specific mass window, allowing the construction of signal-enriched and background-enriched samples via sideband interpolation.

Key Contributions and Validation Strategies
The paper provides a comprehensive guide to the validation and interpretation of these methods, emphasizing that standard practices are insufficient for model-agnostic searches.

Validation of the Null Hypothesis:
- The authors detail three complementary strategies to ensure false-positive rates are controlled:
  1. Simulation: Using realistic Monte Carlo samples (with unweighted events to match data statistics) to verify no spurious excesses occur.
  2. Data Control Regions: Testing on data regions assumed to be signal-depleted (e.g., specific kinematic regions orthogonal to the search). The paper acknowledges the risk that unknown signals could contaminate these regions.
  3. Artificial Samples: Using generative models trained on a downsampled signal region to create "pseudo-data" for bias testing (e.g., the DOWN-UP-SAMPLE strategy used by ATLAS).
- The paper highlights the challenge of validating weakly supervised methods, where the training depends on the signal region data, making the algorithm behavior data-dependent and harder to "freeze" prior to unblinding.
Performance Assessment:
- Performance is benchmarked against fully supervised classifiers (the theoretical upper bound) and inclusive search methods.
- The paper notes that weakly supervised methods exhibit performance that scales with signal strength; they may fail to detect anomalies if the signal fraction is too low (as the classifier overfits background differences) but approach supervised performance at high signal strengths.
Interpretation and Follow-up:
- Excess Interpretation: Upon finding an excess, the paper suggests using feature distribution comparisons, permutation feature importance, active subspace methods (analyzing classifier gradients), and reweighting functions (in NPLM) to characterize the anomaly.
- Follow-up Searches: A critical distinction is made between follow-up searches on the same dataset (which suffer from an unquantifiable "Look-Elsewhere Effect" and cannot yield a well-calibrated global p-value) and those on independent datasets (which can). The authors recommend pre-defining holdout datasets (20–50% of data) for independent verification.
- Exclusion Limits: Deriving exclusion limits is complex. For outlier detection, models can be released for community reinterpretation. For weakly supervised methods and two-sample tests, the classifier's performance depends on the signal presence in the training data. Reinterpretation requires retraining the classifier with injected signals of varying strengths to map efficiency, a computationally expensive process.

Results and Case Studies
The paper reviews recent applications by the CMS and ATLAS collaborations in dijet resonance searches:

CMS: Deployed a suite of methods including a Variational Autoencoder (outlier detection) and three weakly supervised strategies (CWoLa Hunting, Tag N' Train, CATHODE). The search successfully demonstrated the ability to enhance sensitivity to specific signal topologies (e.g., boosted top quarks) and identified mass sculpting issues, which were mitigated through feature decorrelation and reweighting.
ATLAS: Utilized SALAD and CURTAINS (weakly supervised) and employed the DOWN-UP-SAMPLE validation strategy to identify biases at low resonance masses that other methods missed.
Performance: In these searches, anomaly detection methods achieved significance improvements of up to a factor of 6 over inclusive searches for specific benchmarks but generally remained a factor of two or more less sensitive than fully supervised classifiers trained on the same signals.

Significance and Claims
The paper positions itself as a foundational reference for the "VERaiPHY" initiative, aiming to establish verification and validation standards for AI in physics.

Modest Claims: The authors explicitly state that new physics has not yet been discovered using these methods. Their primary contribution is the demonstration of the power of these approaches to discover phenomena that might be missed by conventional searches and the provision of a framework for their rigorous validation.
Future Outlook: The document argues that as theoretical guidance remains scarce in certain regimes, the adoption of flexible, model-agnostic approaches will likely grow in collider physics, cosmology, and astrophysics. It emphasizes that while these methods offer broader exploration, they require careful statistical validation to control false discovery rates and robust interpretation strategies to translate anomalies into physical insights. The paper concludes that a trade-off exists between sensitivity and model-agnosticity, and that no single test is uniformly most powerful across all possible alternatives.

Model-Agnostic Signal Discovery with Machine Learning: Bridging the Gap Between Theory and Practice