Searching for Anomalies with Foundation Models

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to find a rare, mysterious criminal in a massive city. The city is full of ordinary people doing normal things (like going to work or shopping), but you suspect there's a hidden gang of spies doing something weird.

This paper is about two detectives (scientists) who tried to find these "spies" (new physics) in a giant dataset of particle collisions from the Large Hadron Collider (CMS experiment). They used a new, high-tech tool called a Foundation Model, which is like a super-smart AI that has read every book in the library to understand how the world usually works.

Here is the story of their investigation, broken down simply:

1. The Setup: The "Omni-Learned" Detective

The scientists used an AI called OmniLearned. Think of this AI as a detective who has studied millions of photos of "normal" traffic. It knows exactly what a normal car, a bicycle, or a pedestrian looks like.

The Small Detective: They first tried a smaller version of this AI. It worked great! It found the "Top Quark" (a known particle, like finding a specific type of car) exactly where physics said it should be.
The Big Detective: Then, they tried the Large version of the AI, which is much smarter and has seen way more data. They expected it to be even better.

2. The Glitch: The "Weird Noise"

When the Large Detective looked at the data, it found the Top Quark, but it also started screaming about something else. It pointed to a specific area in the data (a "mass sideband") where the numbers didn't look smooth.

The Analogy: Imagine you are listening to a radio station playing smooth jazz. Suddenly, the Large Detective hears a strange, rhythmic static in the background. The Small Detective didn't hear it; the Large Detective did.
The scientists thought, "Is this a new particle? Or is the AI just confused?"

3. The Investigation: Checking the Evidence

To be sure, the scientists had to do a full forensic audit. They couldn't just trust the AI; they had to prove the "static" wasn't just a glitch in the radio.

The Background Check: They used a method called ABCD. Imagine you have four rooms in a house. You know how many people are in three of them. If you know the rules of the house, you can mathematically guess how many people should be in the fourth room (the one with the "anomaly").
The Result: In most rooms, the math worked perfectly. The "noise" was just normal background static. But in the specific room the Large Detective pointed to, the math failed. The actual data didn't match the prediction. There was an unexpected "bump" in the data around a mass of 150 GeV.

4. The Suspect: The "Double Higgs" Theory

Since the background math didn't fit, the scientists asked: "What if there is a new signal here?"

They tested a hypothesis: Could this be Double Higgs Bosons (two Higgs particles created at once)?
The Fit: When they added this "Double Higgs" suspect to their model, the messy data suddenly looked much neater. It was like putting the missing puzzle piece in place.
The Catch: To make the math work, they had to assume there were 4,000 times more Double Higgs events than the Standard Model of physics predicts. That's like finding a needle in a haystack, but the needle is actually a giant golden statue. It's statistically unlikely to be real, but the pattern is suspicious.

5. The Twist: The "Substructure" Clue

The scientists dug deeper. They looked at the "substructure" of the particles (how the energy is packed inside the jets).

They found that the weird events had a specific signature: One jet was heavy (around 150 GeV), and the other was also heavy (over 100 GeV).
They also checked if these jets contained "bottom quarks" (a specific type of particle). When they filtered for those, the "weirdness" got even stronger.
The Confusion: They tried using a different, specialized tool designed specifically to find Double Higgs. Surprisingly, that tool didn't see the same weird events! Only the giant "OmniLearned" AI saw them. This suggests the AI is picking up on a very subtle, strange pattern that human-designed tools miss.

6. The Conclusion: "We Don't Know Yet"

The paper ends with a very honest conclusion:

The Good News: The Large AI found something the Small AI missed. The background models (our understanding of "normal") can't explain this specific bump in the data.
The Bad News: It's probably not a new particle (yet). The numbers are too extreme, and other tools don't see it.
The Mystery: The Large AI might be "hallucinating" (seeing patterns that aren't there), or it might be detecting a very subtle flaw in our current physics models.

The Takeaway:
The scientists are inviting the rest of the world to look at this "glitch." They are saying, "We have a strange signal that our best models can't explain. It might be a mistake in our math, or it might be the first hint of something new. Please come help us figure it out!"

It's a reminder that in science, sometimes the most interesting discoveries start with a weird noise that no one else can hear.

1. Problem Statement

The paper addresses the challenge of Anomaly Detection (AD) in high-energy physics, specifically within the context of the Large Hadron Collider (LHC) data from the CMS experiment.

Context: Foundation Models (FMs), such as the OmniLearned model, have shown promise in identifying rare, unknown physics phenomena by learning representations from large datasets without explicit signal training.
The Issue: Previous work demonstrated that small and medium-sized OmniLearned models could successfully "rediscover" the top quark ( $t\bar{t}$ ) as an anomaly. However, when the large OmniLearned model (with ~250x more parameters) was applied to the same task, it exhibited unexpected behavior. Specifically, the model selected events that resulted in a non-smooth mass spectrum in the sideband regions, deviating from the expected smooth background distribution.
Goal: The authors aim to perform a rigorous, full-scale analysis of the phase space selected by the large model to determine if the observed anomalies represent new physics, a modeling failure, or a statistical fluctuation.

2. Methodology

The study utilizes CMS Open Data from 2016 (13 TeV, 16.39 fb $^{-1}$ ) and employs a comprehensive data-driven approach combined with simulation validation.

Data and Simulation

Data: Proton-proton collisions focusing on high- $p_T$ jet pairs.
Simulations: Includes QCD multijet (LO), $t\bar{t}$ (NLO), Single Top, $W/Z$ +Jets, Dibosons, and Higgs processes. Simulations use MadGraph5_aMC@NLO, POWHEG-BoxV2, and Pythia8, with Geant4 for detector simulation.
Event Selection:
- Two large-radius jets ( $R=0.8$ ) with $p_T > 450$ GeV and $|\eta| < 2.5$ .
- Soft drop mass ( $m_{SD}$ ) $> 60$ GeV.
- Rejection of events with isolated muons or electrons.

Anomaly Scoring

The anomaly score is derived from the OmniLearned foundation model.
The score is defined as the sum of output class predictions for particle decays (2-, 3-, or 4-prong jets) divided by the sum of classes associated with QCD jets.
The analysis compares results from the Small and Large OmniLearned models.

Background Estimation (ABCD Method)

To avoid reliance on potentially unreliable QCD simulations in the signal region, the authors use a data-driven ABCD method:

Observables: Two independent variables are used:
1. The anomaly score of the leading jet.
2. The anomaly score of the subleading jet.
Regions: These define four regions (A, B, C, D). Region A (Signal Region) contains events where both jets pass the high anomaly threshold.
Validation: The method is validated using QCD simulations to ensure the prediction $N_A = (N_B \times N_C) / N_D$ holds.
Subjettiness Split: The analysis is further stratified by the $\tau_{21}$ observable (a measure of jet substructure) to separate top-like jets ( $\tau_{21} < 0.45$ ) from generic QCD jets ( $\tau_{21} > 0.45$ ). This creates 8 independent regions for simultaneous fitting.

Statistical Framework

A binned maximum likelihood fit is performed using the COMBINE tool.
Systematic uncertainties (jet energy scale, resolution, luminosity, theoretical scales) are included as nuisance parameters.
Goodness-of-Fit (GOF) tests using saturated test statistics are employed to assess model compatibility.

3. Key Contributions

Full Background Modeling: Unlike previous AD studies that assumed smooth backgrounds, this paper provides a complete, data-driven background estimation with rigorous uncertainty quantification for the phase space selected by a large foundation model.
Discrepancy Identification: The authors demonstrate that while the Small OmniLearned model correctly identifies the top quark peak with a smooth background, the Large OmniLearned model selects a population of events where the Standard Model (SM) background fails to describe the data, particularly in the low-mass sideband.
Hypothesis Testing: The paper systematically tests various signal hypotheses (including Di-Higgs production) against the observed anomaly to see if they can resolve the discrepancy.
Cross-Validation: The authors compare the OmniLearned selection against a dedicated $X \to b\bar{b}$ tagger to check for consistency in jet substructure properties.

4. Results

Small Model Performance

The small model successfully rediscovered the top quark peak ( $m_{SD} \approx 172$ GeV).
The background estimation (ABCD) described the data well in all regions, with a Goodness-of-Fit p-value of 0.588, indicating consistency with SM expectations.

Large Model Performance

Top Quark: The large model also shows a top quark peak, but it is less distinct due to a significant excess in the lower mass sideband.
Background Mismatch: The SM-only background model fails to describe the data in the signal region (specifically where $\tau_{21} < 0.45$ ). The GOF test yields a p-value of 0.092, indicating a tension between the data and the SM prediction.
The Anomaly: An unexpected structure appears around 150 GeV in the leading jet soft drop mass.
- When fitting with a Di-Higgs ($HH$) signal template (scaled by a factor of 4000 to match the excess), the fit improves significantly.
- The observed significance for the $HH$ hypothesis reaches 2.38 $\sigma$ (baseline) and increases to 3.92 $\sigma$ when requiring the subleading jet mass $> 100$ GeV.
- Adding a b-tagging requirement further increases the significance to 4.11 $\sigma$ .

Cross-Checks and Limitations

$X \to b\bar{b}$ Tagger: When the anomaly score is replaced by a standard $X \to b\bar{b}$ tagger (trained on generic $b$ -jets), the excess disappears, and the data is consistent with the background-only hypothesis (significance $\approx 1.02\sigma$ ).
Overlap: Only 20–30% of the events selected by the OmniLearned large model are also selected by the $X \to b\bar{b}$ tagger. This suggests the large model is selecting events with unique jet substructure properties not captured by standard $b$ -taggers.
Dijet Mass: The dijet invariant mass distribution does not show a clear resonant peak corresponding to the 150 GeV jet mass excess, suggesting the anomaly might not be a simple resonance or that the kinematics are complex.

5. Significance and Conclusion

Scientific Impact: The paper highlights a critical limitation in blindly trusting large foundation models for anomaly detection without rigorous background validation. The large model's "anomaly" appears to be a specific selection of events that the current SM background models (and standard taggers) cannot explain.
Nature of the Excess: While the excess mimics the kinematics of Di-Higgs production (specifically in jet mass), the authors do not claim evidence for new physics. The required cross-section is far beyond current limits, and the lack of correlation with standard $b$ -tagging suggests the events may be a statistical fluctuation, a simulation artifact, or a subtle detector effect.
Call to Action: The authors invite the community to scrutinize these specific events and the methods used. They emphasize that the results are limited by the dataset size (16.39 fb $^{-1}$ ) and that larger datasets are needed to confirm or refute the existence of this anomaly.
Open Science: All code and data are made public to facilitate further investigation.

In summary, this work serves as a benchmark study demonstrating that while foundation models are powerful tools for AD, they can produce unexpected selections that challenge standard background modeling, necessitating deep, data-driven validation before claiming new physics.