Event Tokenization and Masked-Token Prediction for… — Plain-Language Explanation

Original authors: Ambre Visive, Polina Moskvitina, Clara Nellist, Roberto Ruiz de Austri, Sascha Caron

Published 2026-01-28

📖 4 min read🧠 Deep dive

Original authors: Ambre Visive, Polina Moskvitina, Clara Nellist, Roberto Ruiz de Austri, Sascha Caron

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine the Large Hadron Collider (LHC) as a massive, high-speed car crash simulator. Every second, it smashes particles together, creating a chaotic explosion of debris. Physicists are looking for a very specific, rare type of crash—like finding a specific, unusual scratch on a car that only happens if a secret, invisible force is at play. This is the "signal."

The problem is that most crashes look very similar to each other. They are the "background noise." In this paper, the authors are trying to find a needle in a haystack without knowing exactly what the needle looks like beforehand.

Here is how they did it, using a clever trick borrowed from how computers learn to read and write.

1. Turning Physics into a Language

The authors realized that the data from these particle crashes could be treated like a sentence in a language.

The "Words": Instead of letters, the "words" (or tokens) are the particles flying out of the crash. Some are jets of energy, some are electrons, some are muons.
The "Sentence": A single crash event is a sentence made of about 18 of these "words," plus a few extra numbers describing the total missing energy (like a missing piece of the puzzle).

To make this work for a computer, they had to translate these physical particles into a code the machine understands. They created a system where every particle type and its speed/direction gets assigned a specific number, turning a complex physics event into a simple list of numbers, like [3, 1, 5, 2, ...].

2. The "Fill-in-the-Blanks" Game

The team used a type of Artificial Intelligence called a Large Language Model (LLM)—the same kind of technology that powers chatbots. However, they didn't teach it to write stories. Instead, they taught it to play a game of "Fill-in-the-Blanks" using only the "background" crashes (the common, boring ones).

The Training: They showed the AI thousands of normal crashes but hid one "word" (particle) in each sentence. The AI had to guess what that missing particle was based on the rest of the sentence.
The Goal: The AI learned the "grammar" of normal particle crashes. It learned, for example, "If I see a heavy jet here, I usually expect a specific type of electron there."

3. Spotting the Anomaly

Once the AI became an expert at predicting the "normal" crashes, they tested it on new data, including the rare "signal" crashes they were looking for.

The Test: They hid a particle in a crash event and asked the AI to guess it.
The Result: When the AI looked at a normal crash, it guessed correctly most of the time. But when it looked at the rare, strange "four-top-quark" crash, it got confused. Because this rare event didn't follow the "grammar" of the normal background, the AI's guesses were wrong.
The Alarm: The more wrong the AI was, the more likely it was that the event was an anomaly (the signal they wanted).

4. How Well Did It Work?

The authors tested this method on a search for "four-top-quark" production (a very rare event where four heavy particles are created at once).

The Score: They measured how well the AI could separate the "normal" crashes from the "rare" ones. They got a score (called ROC-AUC) of 0.67.
The Comparison: They compared their method to other established ways of finding anomalies.
- It didn't beat the very best existing method (called DDD).
- However, it did better than two other common methods (DeepSVDD and DROCC).

The Bottom Line

The paper claims that treating particle physics data like a language and using a "fill-in-the-blanks" AI is a promising new way to find rare, unknown physics events. While it isn't the perfect solution yet, it successfully identified subtle differences in the data that other methods missed, suggesting that this "language-based" approach could be a valuable tool for future discoveries at the LHC.

Technical Summary: Event Tokenization and Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider

Problem Statement
The paper addresses the challenge of identifying rare, Beyond the Standard Model (BSM) signatures in high-energy physics data without prior knowledge of the signal characteristics. Specifically, the authors focus on the search for simultaneous four-top-quark ( $t\bar{t}t\bar{t}$ ) production at the Large Hadron Collider (LHC). This process is difficult to isolate because its final state (0–4 leptons, 4–12 jets, including four $b$ -jets) closely resembles complex Standard Model (SM) backgrounds such as $t\bar{t}WW$ , $t\bar{t}W$ , $t\bar{t}Z$ , and $t\bar{t}H$ . The authors propose using Large Language Models (LLMs) as unsupervised anomaly detectors to learn the distribution of background events and flag deviations that may indicate new physics.

Methodology
The proposed approach utilizes a lightweight, encoder-based transformer network trained via masked-token prediction, a technique adapted from natural language processing (specifically BERT).

Dataset and Preprocessing:
- The study uses simulated $pp$ collision data at $\sqrt{s} = 13$ TeV from the Dark Machines challenge, generated with MG5_aMC@NLO, hadronized with Pythia 8, and processed through Delphes 3.
- Events are represented as sequences of up to 18 particle objects (jets, leptons, photons) plus missing transverse energy ( $E_T^{\text{miss}}$ ) and its azimuthal angle ( $\phi_{E_T^{\text{miss}}}$ ).
- Background processes ( $t\bar{t}H, t\bar{t}W, t\bar{t}WW, t\bar{t}Z$ ) constitute the training set, while $t\bar{t}t\bar{t}$ serves as the signal for evaluation.
Tokenization Strategy:
- A critical component of the method is the conversion of continuous kinematic variables into discrete tokens.
- Particle types are mapped to 7 predefined categories.
- Kinematic variables ( $p_T, \eta, \phi, E_T^{\text{miss}}, \phi_{E_T^{\text{miss}}}$ ) are binned. The optimal configuration divides $p_T$ , $\eta$ , and $E_T^{\text{miss}}$ into 4 bins (each containing 25% of background data) and $\phi$ and $\phi_{E_T^{\text{miss}}}$ into 4 bins of width $\pi/4$ .
- These bins are combined into a unique integer token for each particle ( $token_{part} \in [1, 448]$ ) and for the missing energy components ( $token_{E_T^{\text{miss}}} \in [449, 452]$ , $token_{\phi_{E_T^{\text{miss}}}} \in [453, 456]$ ).
- Events are padded to a fixed sequence length of 18 particles plus the energy tokens.
Model Architecture and Training:
- The model consists of two transformer layers with four self-attention heads each, followed by a linear projection and a softmax layer.
- Training: The model is trained exclusively on background events using a masked-token prediction objective. One token per event is randomly masked, and the model learns to reconstruct it using Sparse Categorical Cross-Entropy loss.
- Inference: During testing, all tokens in an event are masked and reconstructed one by one. The average reconstruction score (loss) is calculated for each event.

Key Contributions

Novel Application: The paper introduces the use of LLM-like architectures for unsupervised anomaly detection in collider physics, treating particle events as sequences of tokens.
Tokenization Scheme: It proposes a specific binning and encoding strategy to transform continuous particle physics data into a format suitable for transformer-based models.
Model-Independent Search: The method operates without signal knowledge, relying solely on the reconstruction performance of background events to identify anomalies.

Results

Performance on Four-Top Search: When applied to the $t\bar{t}t\bar{t}$ signal, the model achieved a Receiver Operating Characteristic Area Under the Curve (ROC-AUC) of 0.67.
Distribution Overlap: The reconstruction score distributions for background and signal events showed a common area of 70.85%, indicating a degree of overlap but also the model's ability to distinguish between the two classes to some extent.
Comparison: The proposed method was compared against established unsupervised methods (DDD, DeepSVDD, and DROCC) from the Dark Machines challenge. The results indicate that while the LLM-based approach did not surpass the DDD-based techniques, it demonstrated improved performance over DeepSVDD and DROCC, positioning it as a competitive unsupervised anomaly detection technique.

Significance and Claims
The authors characterize the results as preliminary but promising. They claim that the approach successfully captures subtle discrepancies in collider data and offers a flexible, token-based representation for model-independent searches. The paper suggests that with further optimization of the tokenization scheme and model architecture, this method could become a viable candidate for improving sensitivity to rare Standard Model processes and uncovering new physics signatures in future high-energy physics analyses. The work does not claim to have outperformed all existing methods but highlights the potential of adapting transformer architectures to the specific structural challenges of particle physics data.

Event Tokenization and Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider