Optimal Transport Event Representation for Anomaly… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to find a single, tiny, counterfeit coin hidden inside a massive bag of 100,000 genuine coins. The counterfeit coin is slightly different, but it's so small and the bag is so huge that if you just look at the total weight or the average size of the coins, you might miss it completely.

This is exactly the challenge physicists face at the Large Hadron Collider (LHC). They are smashing particles together to find "new physics" (the counterfeit coin), but it's buried under a mountain of ordinary particle collisions (the genuine coins).

Here is a simple breakdown of what this paper does, using everyday analogies.

1. The Problem: Too Much Noise, Too Little Signal

In the past, physicists tried to find these new signals in two main ways:

The "Expert" Way: They looked at specific, pre-chosen features (like the total weight of the bag). This is like checking if the bag is heavier than usual. It works well if the fake coin is heavy, but if the fake coin is just slightly different in shape, the weight check fails.
The "AI" Way: They fed the computer raw data (the shape of every single particle) and let a massive AI figure it out. This is like giving the AI a microscope and asking it to scan every single coin. The problem? If the fake coin is extremely rare (less than 1% of the bag), the AI gets confused. It needs a huge amount of training data to learn what "weird" looks like, and it often fails when the signal is very weak.

2. The Solution: "Optimal Transport" (The Moving Truck Analogy)

The authors introduce a new tool called Optimal Transport (OT).

Imagine you have two piles of sand.

Pile A is a perfect circle (the background noise).
Pile B has a weird bump in it (the signal).

How do you measure how different they are?

Old way: You might just measure the height of the highest point or the total volume.
The OT way: Imagine you have a fleet of moving trucks. Your job is to move the sand from Pile A to Pile B to make them look identical. You want to do this using the least amount of fuel (effort) possible.
- If the piles are very similar, you only need to move a little sand a short distance.
- If the piles are very different (like the bump in Pile B), you have to move a lot of sand a long way, or use a lot of trucks.

The "cost" of this moving job tells you exactly how different the two events are. This method is brilliant because it understands the geometry and shape of the data, not just the numbers.

3. The Innovation: "Linearizing" the Moving Trucks

Calculating the exact "moving cost" for every single particle collision is incredibly slow and computationally expensive (like trying to plan a route for millions of trucks simultaneously).

The authors' big breakthrough was Linearization.
Instead of calculating the full, complex moving route every time, they created a "shortcut map." They took the complex shape of the particle collision and flattened it into a simpler, structured list of numbers (a vector) that still keeps all the important shape information.

Think of it like this: Instead of trying to memorize the entire layout of a city to find a house, you just need a simple set of coordinates (Latitude/Longitude) that gets you there. The "OT representation" is that coordinate system for particle collisions.

4. The Results: Finding the Needle in the Haystack

The team tested this on the "LHC Olympics" datasets (a standard test for particle physics AI).

The Setup: They injected a tiny signal (0.5% of the data) that was supposed to be a new particle.
The Competition: They compared their new "OT features" against:
1. Standard physics measurements (the "Expert" way).
2. Massive, pre-trained AI models that look at raw data (the "AI" way).
The Winner: The OT method crushed the competition in the low-signal regime.
- It found the signal twice as well as the standard expert measurements.
- It found the signal better than the massive AI models, even though the OT method was much simpler and required less computing power.

5. Why It Matters

The most surprising part is that they didn't need all the complex data. They only needed the top 3 to 5 numbers from their new OT map to get the best results.

The Lesson: You don't always need a bigger, smarter AI. Sometimes, you just need a better way to describe the data.
The Bridge: This method acts as a perfect bridge. It takes the raw, messy data from the collider and turns it into a clean, structured format that even simple, fast computer programs (like Boosted Decision Trees) can understand perfectly.

Summary

This paper says: "Stop trying to brute-force the problem with massive AI models or relying on old-school measurements. Instead, use a physics-based 'moving cost' map to describe the shape of the collision. It's simpler, faster, and much better at finding the tiny, rare signals that we are desperate to discover."

It's like realizing that to find a lost key in a dark room, you don't need a super-computer to scan every inch; you just need a flashlight that knows exactly where the light should shine based on the shape of the room.

1. Problem Statement

The search for physics beyond the Standard Model (BSM) at the Large Hadron Collider (LHC) is increasingly relying on Anomaly Detection (AD) rather than targeted searches, as new physics signals may not fit specific theoretical models.

Weak Supervision (WS): A dominant paradigm where classifiers distinguish between "signal-rich" data (a mixture of background and potential signal) and "background-only" reference data, without needing explicit event-level labels.
The Challenge:
- High-level observables: Traditional methods use engineered features (e.g., jet mass, $n$ -subjettiness). While robust, they may miss complex signal characteristics if the chosen observables are not sensitive to the specific anomaly.
- Low-level inputs (End-to-End): Using raw particle four-momenta allows models to learn directly from data. However, these approaches (especially foundation models) require massive datasets and computational resources. Crucially, they often struggle in the ultra-low signal regime (e.g., signal-to-background ratio $S/B < 1\%$ ) due to insufficient statistics for effective training.
The Gap: There is a need for an intermediate representation that captures the full geometric structure of collision events (like low-level inputs) but remains compact, physically grounded, and effective with limited data (like high-level observables).

2. Methodology

The authors propose a novel Optimal Transport (OT) based event representation that serves as a bridge between engineered features and raw data.

A. Optimal Transport (OT) Foundation

The method utilizes the 2-Wasserstein metric ( $W_2$ ), which defines the distance between two events as the minimum "cost" to transform the particle distribution of one event into another.
This metric is Infrared and Collinear (IRC) safe, a critical property for particle physics, ensuring stability against soft radiation and collinear splittings.

B. Linearized OT (LinW2) Representation

Instead of using $W_2$ merely as a distance metric between pairs of events, the authors introduce a linearization technique to create a fixed-dimensional feature vector for each event:

Reference Jet: A fixed reference jet $R$ is defined as a $10 \times 10$ grid over the $(y, \phi)$ plane with uniform transverse momentum ( $p_T$ ).
Embedding: Each event's jet is mapped to the tangent space of this reference jet. The $i$ -th component of the embedding is the $p_T$ -weighted barycenter of where the $i$ -th reference particle is mapped in the target event.
Dimensionality: For a $10 \times 10$ grid, this results in a 200-dimensional vector per jet (since coordinates are 2D). For a dijet event, the two leading jets are concatenated, yielding a 400-dimensional Euclidean vector.
Feature Extraction: To reduce dimensionality and noise, Principal Component Analysis (PCA) is applied. The authors find that the first few modes capture the majority of the variance.

C. Experimental Framework

Datasets: 2020 LHC Olympics (LHCO) R&D1 (two-pronged jets) and R&D2 (three-pronged substructure) datasets.
Signal Injection: Resonant decays $W' \to XY$ with masses $3.5 \text{ TeV} \to 500 \text{ GeV} \to 100 \text{ GeV}$ .
Weak Supervision Setup:
- Set A1: Pure background.
- Set A2: Background + injected signal (varying $S/B$ from $0.2\%$ to $10\%$ ).
- Classifier: Boosted Decision Trees (BDTs) trained to distinguish A1 from A2.
Baselines:
- Standard high-level observables (Jet mass, $\tau_{21}, \tau_{32}$ ).
- Full Phase Space (low-level four-momenta) using dedicated models.
- Pre-trained foundation models (OmniLearn).

3. Key Contributions

Conceptual Shift: Moving from using OT as a metric to using it as a feature representation via linearization.
Efficient Intermediate Representation: Demonstrating that a compact set of OT-derived features (specifically the first 3–5 PCA modes) captures essential geometric structures of the event without requiring massive foundation models.
Superiority in Low-Signal Regime: Proving that OT-augmented features significantly outperform both standard high-level observables and end-to-end deep learning methods when signal statistics are extremely low ( $S/B \lesssim 1\%$ ).
Complementarity: Showing that OT features provide information complementary to traditional $n$ -subjettiness, further enhancing detection when combined.

4. Results

The performance is measured by Significance Improvement (SI), defined as the ratio of the significance of the new method to the baseline.

Ultra-Low Signal Regime ( $S/B \lesssim 0.7\%$ ):
- OT-Augmented Features: Achieved a maximum SI of $\gtrsim 25$ on R&D1.
- Comparison: This is more than 65% higher than standard high-level observables and an order of magnitude higher than low-level full phase space methods.
- Foundation Models: The OT method outperformed the pre-trained OmniLearn foundation model, despite OmniLearn's massive computational cost and pre-training.
Feature Efficiency:
- Only the first 3–5 PCA components of the OT representation were needed to saturate performance gains.
- Using too many features (e.g., OT100) degraded performance in the low-signal regime due to the difficulty of training BDTs on highly correlated inputs with limited signal statistics.
High-Signal Regime ( $S/B \approx 10\%$ ):
- Low-level full phase space methods eventually surpassed OT (SI $\approx 50$ vs. OT100 SI $\approx 33$ ), as they can exploit raw four-momenta when ample data is available.
- However, OT100 still significantly outperformed standard observables (SI $\approx 33$ vs. $\approx 18$ ).
Ablation Studies:
- OT features alone were insufficient; they require combination with jet mass.
- When combined with extended sets of $n$ -subjettiness (up to $\tau_{9}$ ), OT features provided marginal but consistent gains, suggesting OT captures information similar to high-order subjettiness but in a more structured, geometric way.

5. Significance and Conclusion

Physics-Aware AI: The paper validates that physics-informed representations (incorporating IRC safety and geometric structure) are superior to purely data-driven approaches in data-scarce regimes.
Bridging the Gap: The OT representation acts as a powerful bridge, offering the structural richness of low-level data with the statistical efficiency of high-level observables.
Future Directions:
- The framework is adaptable to complex topologies (e.g., hidden valley scenarios with high-multiplicity radiation).
- It suggests that the full OT formulation may serve as a unified intermediate representation capable of encoding all relevant geometric information, potentially subsuming traditional observables like $n$ -subjettiness.
- The method opens new avenues for non-resonant anomaly detection and multi-species transport (incorporating particle ID).

In summary, this work demonstrates that Optimal Transport linearization provides a highly efficient, robust, and physically grounded feature set that dramatically improves the sensitivity of weakly supervised anomaly detection, particularly in the critical low-signal regime where new physics is most likely to be hidden.

Optimal Transport Event Representation for Anomaly Detection