A linear PDF model for Bayesian inference

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to reconstruct a shattered vase, but you don't have all the pieces. You only have a few shards (experimental data) and you need to guess what the whole vase looked like. In the world of particle physics, this "vase" is the Proton, and the "shards" are data from the Large Hadron Collider (LHC).

The "pieces" of the proton are called Partons (quarks and gluons). To understand how the proton behaves, physicists need a map called a Parton Distribution Function (PDF). This map tells us how likely it is to find a specific parton carrying a certain amount of energy.

The problem? We can't see the map directly. We have to guess it based on the shards we have. If we guess wrong, our predictions for future experiments will be off.

Here is a simple breakdown of what this paper does to solve that guessing game.

1. The Old Way: Trying to Draw with a Wobbly Hand

Traditionally, physicists tried to draw this map using a complex, flexible shape (like a neural network). Think of this like trying to draw a perfect circle using a very wobbly, high-tech pen.

The Good: It's very flexible and can draw almost anything.
The Bad: It's incredibly hard to calculate the "uncertainty." If you ask, "How sure are you that this line is right?" the math gets so messy and heavy that computers take forever to answer. It's like trying to count every single grain of sand on a beach to estimate the size of the beach.

2. The New Idea: The "Magic Skeleton"

The authors of this paper say, "Let's stop trying to draw with the wobbly pen. Let's build a skeleton first."

They use a mathematical trick called Proper Orthogonal Decomposition (POD).

The Analogy: Imagine you have a thousand different photos of people running. If you stack them all on top of each other, you can see the "average" pose and the most common ways the body moves (the skeleton).
The Process: They took a massive library of "possible" proton maps (generated by a neural network) and found the most important "skeleton pieces" (basis functions) that describe them all.
The Result: Instead of a wobbly, complex shape, they now have a linear model. This is like building the vase using a set of pre-made, straight Lego bricks. You just decide how many bricks to use and where to put them.

3. Why "Linear" is a Superpower

Because the new model is built from these "Lego bricks" (linear), the math becomes much simpler.

The Analogy: Imagine you are baking a cake. The old way was mixing ingredients in a giant, chaotic blender where you couldn't taste anything until it was done. The new way is like having a recipe where you add ingredients one by one. You can taste the batter at every step.
The Benefit: This allows them to use Bayesian Inference. In simple terms, Bayesian inference is a rigorous way of updating your beliefs. "I thought the vase looked like X, but now that I see this new shard, I'm 90% sure it looks like Y." Because the math is now "linear" (simple), they can do this updating incredibly fast.

4. The "Goldilocks" Problem (Model Selection)

One of the biggest headaches in science is deciding how complex your model should be.

Too Simple (Underfitting): You use too few Lego bricks. The vase looks blocky and doesn't match the shards.
Too Complex (Overfitting): You use too many bricks. You force the model to fit the "noise" or errors in the data, making it look perfect for the current shards but wrong for the real vase.
The Solution: The authors use a "Goldilocks" strategy. They let the data itself tell them how many bricks are needed. They use a statistical tool (Bayesian Evidence) that automatically penalizes models that are too complicated unless the data really demands that extra complexity. It's like a strict editor who cuts out unnecessary words in a story unless those words add real value.

5. The "Fake Data" Test (Closure Test)

How do we know this new method works? You can't just trust it; you have to test it.

The Analogy: Imagine you invent a new metal detector. To test it, you bury a specific coin in the sand, then use your detector to find it. If the detector finds the coin exactly where you buried it, and correctly estimates how deep it is, you know it works.
The Paper's Test: They created "fake" data (synthetic data) based on a known "true" proton map. They then tried to recover that map using their new method.
The Result: It worked perfectly. The method found the "true" map and, crucially, gave a very accurate estimate of how sure it was about that map. It proved that their "Lego skeleton" approach is robust and doesn't get confused by the noise.

Summary: Why Should We Care?

The Large Hadron Collider is about to enter a new phase (High-Luminosity) where it will produce massive amounts of data. The old methods are too slow and too uncertain to handle this flood of information.

This paper introduces a fast, flexible, and mathematically rigorous way to map the proton. By turning a chaotic, complex problem into a simple "Lego" problem, they allow physicists to:

Speed up calculations significantly.
Trust the uncertainty estimates (knowing exactly how much they don't know).
Prepare for the future of particle physics, ensuring that when we discover something new, we know it's real and not just a glitch in the math.

In short, they built a better, faster, and more honest ruler to measure the building blocks of our universe.

1. Problem Statement

Parton Distribution Functions (PDFs) are essential for theoretical predictions at the Large Hadron Collider (LHC). Determining PDFs from experimental data is an ill-posed inverse problem because one must infer continuous functions from a finite set of observations.

Current Limitations: Traditional PDF fitting methods (e.g., NNPDF, CT, MSHT) use parametric forms with large numbers of parameters. While flexible, these approaches often struggle with:
- Computational Cost: Full Bayesian inference is prohibitively expensive for high-dimensional, non-linear parameter spaces.
- Prior Sensitivity: It is difficult to rigorously quantify how results depend on the choice of priors.
- Model Selection: There is no principled way to determine the optimal model complexity (avoiding underfitting and overfitting) without ad-hoc regularization.
Goal: The authors aim to develop a framework that enables fully Bayesian inference for PDFs, offering rigorous uncertainty quantification and transparent model selection, while remaining computationally efficient enough for global fits.

2. Methodology

The proposed approach introduces a linear PDF model derived from the dimensional reduction of a neural network functional space using Proper Orthogonal Decomposition (POD).

A. Linearization via POD

Instead of using non-linear neural networks directly for fitting, the authors construct a linear basis:

Candidate Space Generation: They generate a large ensemble ( $M = 20,000$ ) of "candidate" PDFs using a deep neural network (NN) architecture similar to NNPDF4.0. Crucially, these NNs are randomly initialized (weights drawn from Glorot distribution) and not trained on data, ensuring the space is unbiased and covers a broad functional range.
Dimensional Reduction: They apply POD (equivalent to Singular Value Decomposition, SVD) to this ensemble.
- The mean of the ensemble becomes the zeroth basis function ( $\phi_0$ ).
- The eigenvectors of the autocorrelation matrix of the centered data become the higher-order basis functions ( $\phi_k$ ).
Linear Parametrization: The PDF is expressed as a linear combination:
$f_w(x) = \phi_0(x) + \sum_{k=1}^N w_k \phi_k(x)$
where $w$ are the weights to be fitted. This transforms the problem from a non-linear NN fit to a linear regression problem in the space of weights.

B. Theoretical Constraints

The POD basis construction inherently preserves linear and homogeneous constraints of the original ensemble:

Sum Rules: Valence and momentum sum rules are satisfied by construction if the initial ensemble satisfies them.
Integrability: The basis functions respect small- $x$ behavior constraints.
Positivity: While not automatic, positivity of cross-sections and PDFs is enforced via Lagrange-multiplier penalty terms added to the $\chi^2$ .

C. Bayesian Workflow

The paper implements a full Bayesian framework:

Likelihood: Combines data agreement ( $\chi^2_{data}$ ) with theoretical penalties (positivity, integrability).
Efficient Sampling (Bayesian Updating):
- The dataset is split into linear (e.g., DIS structure functions) and non-linear (e.g., ratios, hadronic observables) subsets.
- For the linear subset, the likelihood is Gaussian, allowing the posterior to be computed analytically.
- This analytic posterior serves as the prior for the numerical fit (using Nested Sampling) on the non-linear subset. This drastically reduces computational cost.
Model Selection & Averaging:
- The number of basis functions ( $N$ ) is treated as a model hyperparameter.
- Bayesian Evidence ( $Z$ ) is calculated for models with varying $N$ .
- Bayes Factors are used to select the optimal complexity (Occam's razor).
- Bayesian Model Averaging (BMA) is employed to combine results from multiple plausible models, accounting for model uncertainty.

3. Key Contributions

Novel Parametrization: A linear representation of PDFs derived from the POD of a neural network space, offering the flexibility of NNs with the computational tractability of linear models.
Rigorous Bayesian Framework: A complete pipeline for Bayesian PDF determination, including evidence calculation and model averaging, which was previously computationally infeasible for global fits.
Automated Model Selection: A principled method to determine the optimal number of basis functions, preventing overfitting without manual tuning.
Efficient Computation: The "Bayesian updating" strategy (analytic fit for linear data + numerical fit for non-linear data) makes high-dimensional Bayesian inference feasible.

4. Results

The methodology was validated using synthetic Deep Inelastic Scattering (DIS) data (closure tests) based on the NNPDF4.0 dataset.

Completeness & Generalization:
- The POD basis successfully approximates random NN realizations (completeness) with low Mean Squared Error (MSE).
- It also accurately reconstructs PDFs from other major collaborations (CT18, MSHT20), demonstrating generalization beyond the training ensemble.
Model Selection Performance:
- In a closure test where the "true" underlying law had 40 non-zero components, the Bayesian evidence correctly favored a model with 39 parameters (the 40th parameter was not sufficiently constrained by the noise, so the simpler model was preferred via the Occam factor).
- This demonstrates the method's ability to avoid overfitting.
Uncertainty Quantification:
- Normalized Bias ( $R_b$ ): The authors used the normalized bias estimator to check if the predicted uncertainties matched the statistical fluctuations of the data.
- Models with incorrect complexity (under- or over-parameterized) showed biased uncertainties ( $R_b \neq 1$ ).
- The Bayesian Model Averaged result yielded a normalized bias consistent with unity ( $R_b \approx 1$ ), proving that the uncertainty estimates are faithful and robust.

5. Significance

This work represents a paradigm shift in PDF determination:

Feasibility of Bayesian Inference: It proves that full Bayesian inference for global PDF fits is computationally viable, moving beyond the "best-fit + Hessian" approximations used in most current analyses.
Robustness: By explicitly handling model selection and averaging, the method provides more reliable uncertainty estimates, which is critical for High-Luminosity LHC (HL-LHC) precision physics.
Open Science: The authors have released the code (colibri platform) as open-source, facilitating reproducible research and future applications to real LHC data, including simultaneous fits of PDFs and SMEFT Wilson coefficients.

In summary, the paper provides a mathematically rigorous, computationally efficient, and flexible framework for determining Parton Distribution Functions that fully leverages the strengths of Bayesian statistics to quantify uncertainties and model selection.