Machine Learning for Predicting the Proton Structure… — Plain-Language Explanation

Original authors: Shahin Atashbar Tehrani, Elham Astaraki, Fatemeh Arbabifar

Published 2026-06-05✓ Author reviewed ⓘ

📖 4 min read🧠 Deep dive

Original authors: Shahin Atashbar Tehrani, Elham Astaraki, Fatemeh Arbabifar

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine the proton as a tiny, bustling city inside an atom. Inside this city, there are tiny messengers called "quarks" and "gluons" zooming around. Physicists want to know exactly how these messengers are distributed and how they move. To figure this out, they smash particles together in giant machines and look at the results. One of the most important things they measure is called the Proton Structure Function ( $F_2^p$ ). You can think of this function as a detailed "weather map" of the proton city, showing how busy it is in different areas.

Traditionally, to draw this map, scientists have to solve incredibly difficult math puzzles (called DGLAP equations). It's like trying to predict the weather by solving complex fluid dynamics equations from scratch every time. It takes a lot of time and requires making many assumptions.

The New Approach: Teaching a Computer to "See" the Pattern

This paper asks a different question: What if we just show a computer thousands of real photos of the weather map and let it learn the patterns on its own, without solving the math puzzles?

The authors used Machine Learning (ML)—a type of artificial intelligence that learns from data—to predict this proton "weather map." They didn't solve the physics equations; instead, they fed the computer real experimental data from a famous experiment called BCDMS and asked four different types of "student" algorithms to learn the map.

The Four Students

The researchers tested four different AI "students" to see who could learn the map best:

The Multilayer Perceptron (MLP): Think of this as a super-creative artist. It has many layers of neurons (like a deep brain) that allow it to see very complex, squiggly, and non-linear patterns. It's great at capturing the wild, chaotic parts of the proton city.
The Gaussian Process Regression (GPR): This student is like a cautious cartographer. It doesn't just draw a line; it draws a line and a "fog" around it to show how confident it is. If the data is sparse (like a foggy area on the map), GPR admits, "I'm not 100% sure here," rather than guessing wildly.
The Support Vector Regression (SVR): This student is the steady veteran. It focuses on finding the most stable, reliable path. It ignores tiny, noisy details that might be mistakes in the data, focusing only on the big, clear trends.
The Gradient Boosting Regression (GBR): This student is a team of detectives. It starts with a rough guess, then sends out a new "detective" to fix the mistakes of the previous one, over and over again, until the picture is clear.

The Results: Who Won?

After training these students on the data and testing them on new, unseen data, here is what happened:

The Artists (MLP) and the Cartographers (GPR) were the best at accuracy. The MLP student managed to draw the most detailed and accurate map, capturing the complex, non-linear twists and turns of the proton's structure better than anyone else. The GPR student came in a very close second and was excellent at knowing when to say, "I'm uncertain."
The Veteran (SVR) was the most stable. While it wasn't the absolute most accurate, it was the most consistent. It didn't get confused by different chunks of data. If you gave it a slightly different set of training photos, it would still draw a very similar map. This makes it very reliable when the data is messy or noisy.
The Detectives (GBR) did well but had a slight flaw. They learned the main patterns well but were a little too eager to memorize the tiny, random "noise" in the data, making their predictions on new data slightly less sharp than the top two.

The Big Takeaway

The most important finding is that these AI models learned the actual physics of the proton without being told the rules of the game (the math equations).

They didn't just memorize the data points; they learned the underlying "rules" of how the proton behaves.
The fact that the "training" (learning) and "testing" (exam) scores were so close proves they didn't just cheat by memorizing the answers. They genuinely understood the pattern.

Why This Matters

This study shows that Machine Learning is a powerful new tool for physicists. Instead of struggling with heavy math equations to predict how protons behave, they can now use these AI "emulators" to quickly and accurately predict the proton's structure function. It's like having a GPS that learns from real traffic patterns rather than trying to calculate traffic flow from first principles.

The paper concludes that while the traditional math methods are still the foundation, these AI tools are excellent "co-pilots" that can fill in the gaps, especially in areas where we don't have enough experimental data yet.

Technical Summary: Machine Learning for Predicting the Proton Structure Function $F_2^p$ in QCD

Problem Statement
The determination of the proton's partonic structure remains a central objective in Quantum Chromodynamics (QCD). Traditionally, the momentum distribution of quarks and gluons, characterized by the proton structure function $F_2^p(x, Q^2)$ , is analyzed by solving the Dokshitzer-Gribov-Lipatov-Altarelli-Parisi (DGLAP) evolution equations. While successful, this conventional approach relies on specific functional form assumptions, sophisticated fitting strategies, and significant computational resources. There is a growing interest in exploring model-independent, purely data-driven techniques that can complement these theoretical frameworks, particularly in regimes where theoretical assumptions might be supplemented by flexible, nonparametric learning.

Methodology
The authors present a comparative study utilizing four supervised machine learning regression algorithms to predict $F_2^p(x, Q^2)$ directly from high-precision experimental data, bypassing the explicit numerical solution of DGLAP equations.

Dataset: The study employs the BCDMS dataset, comprising 703 measurements of the proton structure function across a wide range of the Bjorken scaling variable $x$ and squared four-momentum transfer $Q^2$ .
Preprocessing: Numerical features ( $x$ and $Q^2$ ) are standardized to ensure convergence and stability. The study strictly avoids data augmentation or synthetic generation, relying solely on original experimental measurements.
Models Evaluated:
1. Support Vector Regression (SVR): Utilizes an $\epsilon$ -insensitive loss function with a Radial Basis Function (RBF) kernel to control model complexity and robustness.
2. Gradient Boosting Regression (GBR): Constructs an additive model of decision trees to minimize a differentiable loss function iteratively.
3. Gaussian Process Regression (GPR): Models the latent function as a Gaussian process with an RBF kernel, providing natural uncertainty estimates.
4. Multilayer Perceptron (MLP): A feedforward neural network optimized via Mean Squared Error (MSE) minimization, leveraging universal approximation capabilities.
Validation Strategy: To ensure statistical robustness, the authors employ $k$ -fold cross-validation (specifically 5-fold) rather than a single train-test split. Hyperparameters for all models are optimized via grid search within the cross-validation loop.
Evaluation Metrics: Performance is assessed using the Coefficient of Determination ( $R^2$ ), Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). Detailed residual analysis and learning curves are used to detect overfitting and systematic biases.

Key Results

Predictive Accuracy: The MLP and GPR models demonstrated superior predictive accuracy on the held-out test set. The MLP achieved the highest $R^2$ score of 0.7310, followed closely by GPR at 0.7231. Both outperformed SVR (0.7080) and GBR (0.7062) in raw accuracy.
Stability and Robustness: While MLP showed the highest accuracy, it exhibited significant variance during cross-validation ( $\pm 0.2238$ ), indicating sensitivity to data partitioning. In contrast, SVR demonstrated the highest stability with the lowest standard deviation ( $\pm 0.0412$ ) and the most consistent mean cross-validation $R^2$ (0.6204), making it particularly robust against experimental uncertainties.
Generalization: All models showed convergence between training and cross-validation metrics, with no significant divergence indicating overfitting. Notably, GPR and MLP exhibited negative "overfitting" values (performing slightly better on validation folds than training data), suggesting effective regularization and the successful capture of underlying physical trends rather than noise.
Residual Analysis: Residual distributions for MLP and GPR were tightly centered around zero with near-Gaussian symmetry and no systematic bias across the kinematic $(x, Q^2)$ plane. SVR showed slightly higher dispersion but maintained unbiased, kinematically independent residuals.

Key Contributions

Data-Driven Framework: The work establishes a data-driven framework that captures the complex nonlinear dynamics of partonic structure without solving DGLAP equations, offering a complementary approach to perturbative QCD analyses.
Comparative Analysis: It provides a rigorous comparison of four distinct regression algorithms (SVR, GBR, GPR, MLP) specifically applied to proton structure function prediction, highlighting the trade-offs between peak accuracy (MLP), probabilistic uncertainty estimation (GPR), and statistical stability (SVR).
Validation of ML in QCD: The study demonstrates that ML models can learn the underlying QCD physics from experimental data alone, as evidenced by the absence of systematic biases and the ability to generalize to unseen kinematic regions.

Significance and Claims
The authors claim that machine learning regression serves as a powerful, complementary tool for structure function analysis in high-energy physics. The significance of this work lies in its demonstration that:

ML models can effectively approximate the dependence of $F_2^p$ on $x$ and $Q^2$ in a model-independent manner.
These models are capable of reliable interpolation between sparse measurements and extrapolation into unmeasured kinematic regimes (e.g., extremely low $x$ or high $Q^2$ ).
The approach offers a flexible alternative for scenarios where perturbative calculations are computationally expensive or where experimental data is sparse, potentially serving as fast surrogates for theoretical calculations.

The paper concludes modestly, noting that while ML offers a flexible approach, future work should focus on integrating these models with fundamental theoretical constraints (such as sum rules and positivity requirements) and extending the framework to other structure functions (e.g., $F_L$ and $g_1$ ) to further enhance physical consistency.

Machine Learning for Predicting the Proton Structure Function F2PF_2^PF2P​ in QCD

More like this

Machine Learning for Predicting the Proton Structure Function $F_2^P$ in QCD