⚛️ phenomenology

Neural Fake Factor Estimation Using Data-Based Inference

This paper proposes a novel neural network-based method for estimating fake lepton backgrounds in high-energy physics by performing density ratio estimation in a high-dimensional feature space, which offers a more precise, flexible, and continuous alternative to traditional binned histogram techniques while reducing binning artifacts and improving extrapolation.

Original authors: Jan Gavranovič, Lara Čalić, Jernej Debevc, Else Lytken, Borut Paul Kerševan

Published 2026-01-29

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Jan Gavranovič, Lara Čalić, Jernej Debevc, Else Lytken, Borut Paul Kerševan

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery at a massive, chaotic party (the Large Hadron Collider). Your goal is to find a very specific, rare guest (a "signal" particle) who is hiding in the crowd. However, the party is full of look-alikes and impostors (background noise) who are dressed almost exactly like your target.

In the world of particle physics, these impostors are called "fake leptons." They are particles that look like the real thing to the detectors but actually came from a different, messy source (like a secondary decay or a misidentified jet). If you count these fakes as real, you might think you found your rare guest when you actually didn't.

The Old Way: The "Grid" Method

Traditionally, physicists have estimated how many of these impostors are in the room using a method called the Fake Factor.

Think of this like trying to guess how many people in a crowd are wearing red hats, but you can't see everyone clearly.

The Control Room: You go to a section of the party where you know almost everyone is wearing a red hat (a "loose" selection). You count them.
The Signal Room: You want to know how many red hats are in the VIP area (the "tight" selection), but you can't look directly there yet because you don't want to bias your search.
The Grid: To make the guess, the old method divides the party into a giant grid of boxes (bins). For every box, they count the red hats in the "loose" area and divide by the total to get a "Fake Factor" (a conversion rate).
The Problem: This grid is rigid.
- If the boxes are too big, you miss the details (like how the hat-wearing changes near the DJ).
- If the boxes are too small, some end up empty, and your math breaks.
- You can only use a few variables (like "where they are standing" and "how tall they are"). If you try to add more details (like "what they are holding" or "how fast they are dancing"), the grid becomes too crowded with empty boxes to be useful.

The New Way: The "AI Detective"

The authors of this paper propose a new method using Machine Learning (Neural Networks) to replace the rigid grid.

Instead of chopping the party into boxes, they train a smart AI to look at every single guest individually.

Learning the Pattern: The AI is shown thousands of examples of "real" particles and "fake" particles. It learns the complex, subtle differences between them, not just based on two or three traits, but based on a whole bunch of details at once (speed, position, energy, number of nearby jets, etc.).
The "Density Ratio": The AI learns to answer a specific question for every single event: "If I see a particle with these exact features, how much more likely is it to be a fake in the 'loose' zone compared to the 'tight' zone?"
The Result: Instead of a single number for a whole box, the AI gives a smooth, continuous score for every single particle. It's like having a personal guide for every guest telling you exactly how suspicious they are, rather than just saying "everyone in this room is suspicious."

How They Tested It

The team tested this new AI detective on a real dataset from the ATLAS experiment (using "Open Data," which is like a public archive of particle collision data).

The Setup: They looked for a specific particle decay ( $W \to e\nu$ ).
The Comparison: They ran the old "Grid" method and the new "AI" method side-by-side.
The Findings:
- In the Control Zone: Both methods worked well, but the AI was smoother. It didn't have the jagged, "stair-step" look of the grid method.
- In the Signal Zone (The VIP Area): This is where the AI shined. When they tried to guess the number of fakes in the VIP area based on the data from the general crowd, the old grid method stumbled. It made big jumps and errors because the grid was too coarse to handle the complex changes in the data. The AI, however, handled the transition smoothly and accurately, capturing subtle patterns the grid missed.

The Bottom Line

This paper claims that by swapping a rigid, box-based counting system for a flexible, AI-driven approach, physicists can:

See more clearly: They can use many more variables at once without running out of data.
Be smoother: They avoid the "jagged" errors caused by empty boxes in a grid.
Be more accurate: They can predict background noise in rare, difficult-to-reach areas of the data much better than before.

Essentially, they replaced a blunt instrument (a ruler with big markings) with a high-precision laser scanner (the AI) to count the impostors, allowing them to find the real rare guests with much greater confidence.

Technical Summary: Neural Fake Factor Estimation Using Data-Based Inference

Problem Statement
In high-energy physics (HEP) analyses, "fake" backgrounds arise from events that fail formal signal selection criteria but are accepted due to mis-reconstructed or mis-identified particles, such as non-prompt leptons or hadronic jets mistaken for leptons. Traditionally, these backgrounds are estimated using data-driven techniques, most notably the Fake Factor method. This method extrapolates the fake lepton contribution from a kinematically adjacent, looser selection region (Control Region, CR) to the Signal Region (SR) using a scale factor (the "fake factor").

The conventional implementation of this method relies on binned estimation, where the fake factor is calculated as the ratio of two histograms (tight vs. loose selections) in a low-dimensional space (typically transverse momentum $p_T$ and pseudorapidity $\eta$ ). This approach faces several limitations:

Binning Artifacts: The choice of binning significantly impacts results; coarse bins lose kinematic features, while fine bins suffer from statistical fluctuations, empty bins, or negative values.
Dimensionality Limits: Due to limited statistics, the method is typically restricted to a few variables, preventing the capture of complex correlations with other event topology variables (e.g., missing transverse energy $E^{miss}_T$ or jet multiplicity).
Extrapolation Uncertainty: Discontinuities caused by binning and the inability to model high-dimensional dependencies degrade the accuracy of extrapolating background estimates to the signal region.

Methodology
The authors propose a novel Machine Learning (ML)-based Fake Factor method that replaces histogramming with neural density ratio estimation. This approach, termed Data-Based Inference (DBI), estimates a continuous, unbinned fake factor function on a per-event basis.

The method is structured in two primary steps:

Subtraction Step (Real Lepton Removal):
Since the fake factor must be derived from fake leptons only, the contribution of real (prompt) leptons must be subtracted from both the tight and loose data samples. The authors train two independent binary classifiers to estimate the ratio of data to Monte Carlo (MC) simulation in the tight and loose regions separately ( $r_{T,L} = N^{data}/N^{MC}$ ).
- These classifiers are trained to distinguish data events (label 1) from MC events (label 0).
- The output is used to reweight data events (or MC events) to obtain "real-subtracted" densities.
- To ensure physical validity (positive weights), a soft absolute activation function is applied to the classifier's logit output, ensuring the ratio $r > 1$ and the resulting weights remain positive.
Ratio Step (Fake Factor Estimation):
A third binary classifier is trained to distinguish between the tight (numerator) and loose (denominator) real-subtracted samples.
- The training dataset consists of reweighted events from both regions.
- The classifier learns the likelihood ratio between the two hypotheses.
- The final fake factor $F(x)$ for an event with features $x$ is estimated as the exponential of the classifier's output: $F(x) = \exp(q(x))$ .
- This yields a continuous function dependent on a high-dimensional feature space (e.g., $p_T, \eta, E^{miss}_T, N_{jets}, m_T$ ).

Model Architecture and Training

Architecture: The authors utilize a pre-activation ResNet with four residual blocks, each containing two layers of 128 neurons. This architecture mitigates vanishing gradients and allows for stable training of deeper networks compared to standard feed-forward networks.
Input Processing: Numerical features are standardized, and categorical features are label-encoded and embedded. An embedding layer maps features to a higher-dimensional space, followed by mean pooling.
Loss Function: The training uses binary cross-entropy with a squared regularization term to prevent exploding densities. For the subtraction classifiers, a soft absolute activation ensures non-negative outputs; for the ratio classifier, a linear activation is used.
Training: The model is trained using the AdamW optimizer with early stopping based on validation loss.

Key Contributions

Continuous, Unbinned Estimation: The method provides a per-event fake factor, eliminating binning artifacts and discontinuities inherent in histogram-based methods.
High-Dimensional Flexibility: By leveraging neural networks, the method can incorporate multiple correlated kinematic variables simultaneously, capturing complex dependencies that traditional binned methods cannot due to the "curse of dimensionality."
Improved Extrapolation: The continuous nature of the estimator allows for smoother and more stable extrapolation from the control region to the signal region.
Validation Framework: The authors demonstrate a robust two-step validation procedure (subtraction and ratio) using ATLAS Open Data, ensuring the method correctly handles real-lepton contamination.

Results
The method was validated using an analysis of $W \to e\nu$ events from ATLAS Run 2 data.

Control Region (CR): The ML-based method showed good agreement with the traditional binned method in the CR. While the binned method performed slightly better in low- $p_T$ regions with high statistics, the ML method demonstrated superior modeling in variables like $E^{miss}_T$ and $m_T$ , which are difficult to include in binned analyses due to statistical constraints.
Signal Region (SR): When extrapolating to the SR ( $m_T > 60$ GeV), the ML-based method provided significantly better predictions in both shape and normalization compared to the binned method. The binned method exhibited larger discrepancies and systematic mis-modeling, particularly in distributions of $E^{miss}_T$ and $m_T$ , due to its reliance on coarse binning and limited variable inclusion.
Stability: The ML approach produced smoother distributions with reduced statistical fluctuations, particularly in regions with lower event counts or complex correlations.

Significance and Claims
The paper claims that the ML-based Fake Factor method represents a significant advancement in data-driven background estimation for high-energy physics. By moving from discrete, low-dimensional binning to continuous, high-dimensional density ratio estimation, the method:

Mitigates common limitations such as binning selection bias and extrapolation uncertainties.
Enhances the ability to model complex correlations between variables.
Improves the sensitivity of searches for rare signals by providing more accurate background estimates, thereby reducing the risk of spurious signals arising from mis-modeling.

The authors emphasize that while the method was demonstrated on a simple $W$ -boson analysis, its framework is inherently adaptable to multi-lepton final states and other mis-identified objects. They note that future work will focus on integrating systematic uncertainty estimation and applying the method to more complex LHC analyses searching for new physics. The code for the implementation is made publicly available.

The Old Way: The "Grid" Method

The New Way: The "AI Detective"

How They Tested It

The Bottom Line

Technical Summary: Neural Fake Factor Estimation Using Data-Based Inference

More like this