⚛️ high-energy theory

Learning the S-matrix from data: Rediscovering gravity from gauge theory via symbolic regression

This paper demonstrates that symbolic regression applied to numerical on-shell data can autonomously rediscover fundamental analytic structures in scattering amplitudes, including KLT, Kleiss-Kuijf, and BCJ relations, thereby establishing a data-driven strategy for uncovering hidden theoretical connections like the gravity-gauge duality without relying on prior group-theoretic knowledge.

Original authors: Nathan Moynihan

Published 2026-02-18

📖 6 min read🧠 Deep dive

CC BY 4.0

Original authors: Nathan Moynihan

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a massive, cosmic mystery. The "crime scene" is the subatomic world, where particles crash into each other and bounce off like billiard balls. Physicists call these collisions scattering amplitudes. They are complex mathematical recipes that tell us the probability of a particle doing one thing versus another.

For decades, physicists have been writing these recipes by hand, using incredibly difficult math. But recently, they started asking: What if we just let a computer look at the data and figure out the recipe for us?

This paper is about a team of researchers who tried exactly that. They used a special type of Artificial Intelligence called Symbolic Regression to rediscover some of the most famous "laws of the universe" regarding how particles interact, specifically how Gravity (the force that holds planets together) is secretly related to Gauge Theory (the force that holds atoms together).

Here is the story of how they did it, explained with simple analogies.

1. The Problem: The "Black Box" vs. The "Recipe Book"

Most modern AI (like the chatbots you use) is a Black Box. You give it a question, and it gives you an answer. It's great at guessing, but it doesn't tell you how it got there. It's like a chef who makes a delicious cake but refuses to tell you the ingredients or the steps. In physics, we don't just want the answer; we want the recipe (the mathematical formula) so we can understand the underlying laws of nature.

Symbolic Regression is different. Instead of just guessing numbers, it tries to find the actual equation. It's like a detective who doesn't just say "the butler did it," but actually writes out the step-by-step logic of how the butler did it, using only the clues found at the scene.

2. The Setup: The "Lego" of the Universe

The researchers focused on two types of particles:

Gluons: The "glue" particles that hold atoms together (Gauge Theory).
Gravitons: The hypothetical particles that carry gravity.

There is a famous, beautiful relationship between them called the KLT Relation. It's like a secret code that says: "If you take two sets of Gluon recipes, mix them together with a specific spice (called a Mandelstam invariant), you get the Gravity recipe."

The goal of the paper was to see if a computer could look at thousands of numbers representing Gluon collisions and figure out this secret code on its own, without the physicists telling it the code exists.

3. The Method: The "Data Shrinker"

The computer was fed a massive amount of data:

The Ingredients: Lists of numbers representing the energy and direction of particles (Mandelstam invariants).
The Output: The results of Gluon collisions.
The Target: The results of Gravity collisions.

The Challenge: The data was messy. It was like trying to find a specific sentence in a library where every book has been shredded and mixed together. There were too many redundant numbers (features).

The Solution (CPQR): The researchers used a mathematical tool called Column-Pivoted QR factorization.

Analogy: Imagine you have a giant pile of ingredients for a soup. Some are just watered-down versions of others (e.g., "salt" and "salty water"). The computer acts like a smart sous-chef who looks at the pile and says, "We don't need all these; we just need these 5 specific spices to make the soup."
By removing the redundant data, the computer automatically rediscovered two famous mathematical rules (KK and BCJ relations) that physicists had already known. The computer found them just by looking for patterns in the numbers, proving it could "think" like a physicist.

4. The Discovery: Re-inventing Gravity

Once the data was cleaned up, the Symbolic Regression engine went to work. It started mixing and matching the remaining ingredients (Gluon results and energy numbers) to see what combination produced the Gravity result.

At 4 particles: It was easy. The computer found the formula almost instantly. It rediscovered the KLT relation, effectively saying, "Aha! Gravity is just Gluons multiplied by each other and some energy numbers!"
At 5 particles: It got harder, but the computer still found the answer. It took a bit longer, but it successfully wrote down the complex formula that connects the two forces.
At 6 particles: The computer hit a wall. The number of possible combinations exploded. It's like trying to solve a Rubik's cube that keeps getting bigger every time you turn a side. The computer got overwhelmed by the sheer number of possibilities (a "combinatorial explosion") and couldn't find the simple answer in the time allowed.

5. The Comparison: The "Translator" vs. The "Detective"

The paper also compared their method (Symbolic Regression) to a newer method using Neural Networks (Deep Learning).

The Neural Network (The Translator): Imagine you have a long, complicated sentence in a foreign language. A Neural Network is like a translator that has read millions of books. It can look at the long sentence and instantly spit out a short, simple version. It's great at simplifying things it has seen before, but it might "hallucinate" (make up a sentence that looks right but is wrong).
Symbolic Regression (The Detective): This method doesn't know the answer beforehand. It looks at the raw data points (the "clues") and builds the formula from scratch. It's slower and needs help to know which clues are important, but the result is a proven, verifiable equation. If the math checks out on new data, it's definitely correct.

The Big Takeaway

This paper is a proof of concept. It shows that AI can be a partner in discovery, not just a calculator.

What worked: The AI successfully "re-discovered" the deep connection between Gravity and Particle Physics using only raw numbers, without being told the rules.
What's next: The AI is currently stuck on the "6-particle" problem because the math gets too messy. The authors suggest a hybrid future: use the Neural Network to clean up the messy data (like a translator simplifying a text), and then use Symbolic Regression to find the final, perfect formula (like a detective solving the case).

In short, the researchers taught a computer to look at the chaos of particle collisions and whisper back the elegant, hidden laws of the universe. It's a step toward a future where we don't just calculate the universe, but let the universe teach us its own secrets.

1. Problem Statement

The paper addresses the challenge of autonomously reconstructing fundamental analytic structures in quantum field theory (QFT) scattering amplitudes directly from numerical data, without relying on prior knowledge of the underlying algebraic formulas.

The Context: Scattering amplitudes (specifically in Yang-Mills theory and General Relativity) possess rich, hidden analytic structures, such as the Kleiss-Kuijf (KK) and Bern-Carrasco-Johansson (BCJ) relations, and the Kawai-Lewellen-Tye (KLT) double-copy relations which express gravity amplitudes as products of gauge theory amplitudes.
The Gap: While Deep Neural Networks (DNNs) can predict numerical outcomes, they lack interpretability and cannot "discover" the underlying physical laws or algebraic forms. Standard symbolic regression (SR) struggles with the combinatorial explosion of features in high-multiplicity scattering problems.
The Goal: To demonstrate that modern machine learning, specifically Symbolic Regression (SR) combined with linear algebraic feature selection, can rediscover these flagship relations (KK, BCJ, KLT) and the Parke-Taylor formula using only numerical on-shell data and minimal theoretical priors.

2. Methodology

The authors propose a data-driven pipeline that integrates linear algebraic dimensionality reduction with symbolic regression. The workflow consists of four main stages:

A. Data Generation

Kinematics: Random on-shell kinematics are generated in the center-of-mass frame for massless bosons in 4D.
Signature: To ensure real-valued data (avoiding complex spinor products), the authors analytically continue to (2, 2) signature, where spinor products are real.
Targets:
- Inputs: Color-ordered Yang-Mills partial amplitudes ( $A_n$ ) and Mandelstam invariants ( $s_{ij}$ ).
- Outputs: Graviton amplitudes ( $M_n$ ) computed independently via Hodges' formula (to avoid biasing the search toward KLT forms).
Preprocessing: Data is rescaled to be dimensionless and $O(1)$ to improve numerical stability.

B. Feature Selection via CPQR (Column-Pivoted QR)

Instead of using Principal Component Analysis (PCA), which produces non-interpretable linear combinations, the authors use Column-Pivoted QR (CPQR) factorization.

Mechanism: CPQR selects a minimal subset of original columns (features) that spans the data space.
Discovery of Redundancies:
- When applied to the matrix of color-ordered amplitudes, CPQR identifies the rank as $(n-2)!$ , automatically revealing the Kleiss-Kuijf (KK) relations as the linear dependencies removed.
- When applied to composite features (products of Mandelstam invariants and amplitudes), CPQR identifies Bern-Carrasco-Johansson (BCJ) relations (degree-1 syzygies) as linear dependencies where coefficients are linear polynomials in $s_{ij}$ .
Benefit: This reduces the feature space from a combinatorial explosion to a minimal, independent basis without imposing group-theoretic constraints a priori.

C. Symbolic Regression (SR)

Algorithm: The authors use PySR, searching for expressions in a space generated by operators $\{+, -, \times, /\}$ .
Feature Engineering: To constrain the search space, they incorporate minimal physical priors:
- Mass Dimension: Ensuring the output matches the dimension of the target amplitude.
- Little-Group Scaling: Ensuring the correct helicity weights.
- Composite Features: Instead of raw inputs, they feed bilinears of amplitudes ( $A \tilde{A}$ ) multiplied by Mandelstam polynomials, as dictated by the double-copy structure.
Basis Selection: A decision-tree model is used to scan different orderings of gluon amplitudes to find the "basis" that minimizes the complexity of the resulting kernel function.

D. Benchmarking

The method is compared against a recent Transformer-based Neural Network approach (Cheung et al.) designed for symbolic simplification.

3. Key Contributions & Results

A. Rediscovery of Linear Relations (KK & BCJ)

The CPQR step successfully identified the KK relations (reducing $(n-1)!$ to $(n-2)!$ amplitudes) and BCJ relations (reducing to $(n-3)!$ ) purely from numerical data.
Significance: This proves that linear algebraic dependencies in scattering data correspond directly to known physical symmetries, recoverable without any group-theoretic input.

B. Rediscovery of the Parke-Taylor Formula

The pipeline successfully rediscovered the closed-form Parke-Taylor formula for MHV gluon amplitudes.
Performance: With physics-motivated priors (restricting to angle brackets $\langle ij \rangle$ and enforcing chirality), the SR found the correct expression in $O(10^3)$ seconds for $n=4,5,6$ , whereas a raw search failed.

C. Rediscovery of KLT Relations

4-Point & 5-Point: The method successfully rediscovered the KLT relations (e.g., $M_4 = -s_{12} A_4 \tilde{A}_4$ ) with high numerical accuracy ( $O(10^{-16})$ ).
6-Point Challenge: The method encountered a combinatorial explosion at 6 points. The search space for the KLT kernel (a degree-3 polynomial in Mandelstams) became too large for current SR algorithms to converge within reasonable compute time (8 hours), even with aggressive feature engineering.
Insight: The difficulty arises because different bases of amplitudes can make the same physical object look like a simple polynomial or a complex rational function with spurious poles. The SR struggles to guess the "simple" basis without further guidance.

D. Comparison with Neural Networks

Neural Networks (Transformers): Excel at symbolic-to-symbolic rewriting. They can take a complex 298-term expression and simplify it to a 2-term expression by learning algebraic identities (Schouten, etc.). However, they risk "hallucinating" incorrect expressions that must be validated.
Symbolic Regression: Excels at numeric-to-symbolic discovery. It infers the function form directly from numerical evaluations. It is more robust (immediate verification on held-out data) but heavily dependent on the quality of the feature set.
Conclusion: The approaches are complementary. Neural networks can act as a pre-processor to simplify inputs or suggest promising feature combinations, while SR performs the final compression and discovery of analytic forms.

4. Significance and Future Outlook

Data-Driven Physics: The paper establishes a framework for "uncovering hidden relations in general theories" using only numerical data, suggesting a path to discovering new physics in regimes where analytic solutions are unknown.
Interpretability: Unlike black-box DNNs, this method produces human-readable algebraic expressions, bridging the gap between machine learning and theoretical discovery.
Limitations: The primary bottleneck is the combinatorial growth of the search space at higher multiplicities ( $n \geq 6$ ).
Future Directions:
- Factorization Bootstrap: Learning amplitudes by probing near physical poles ( $s_I \to 0$ ) to isolate residues, simplifying the learning task.
- Hybrid Pipelines: Combining neural simplification (to reduce input complexity) with symbolic regression (to find the final analytic form).
- Extensions: Applying the method to loop-level amplitudes (introducing transcendental functions) and string theory KLT relations.

In summary, this work demonstrates that Symbolic Regression, when guided by linear algebraic feature selection (CPQR) and minimal physical priors, is a powerful tool for reconstructing the analytic structure of the S-matrix, successfully rediscovering foundational relations in gauge and gravity theories from numerical data alone.