Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations

Imagine you are a detective trying to solve a mystery: What are the hidden rules of nature?

Scientists have always wanted to find simple, elegant formulas (like $F=ma$ ) that explain how the world works, from the swing of a pendulum to the spread of a virus. But in the real world, data is messy. It's full of static (noise), missing pieces (unobserved variables), and confusing patterns.

The paper introduces a new detective tool called SymLang. Think of it as a super-smart, physics-aware AI that doesn't just guess; it reasons its way to the truth.

Here is how SymLang works, broken down into simple analogies:

1. The Problem: The "Infinite Library" Trap

Imagine you are trying to find a specific book in a library that contains every possible sentence ever written, including gibberish like "Purple clouds eat Tuesday."

Old methods (like genetic programming) would randomly pick books, read them, and see if they make sense. This takes forever because 99.9% of the books are nonsense.
Other methods (like SINDy) only look in a small, pre-selected section of the library. If the answer is in the "Science Fiction" section but they only look in "History," they will never find it.

2. The Solution: SymLang's Three Superpowers

SymLang solves this by combining three distinct ideas into one powerful workflow.

A. The "Grammar of Physics" (The Filter)

Before the AI even starts guessing, it builds a filter based on the laws of physics.

The Analogy: Imagine a strict librarian who says, "You cannot write a sentence where a 'Time' word is added to a 'Distance' word. That makes no sense!"
How it works: SymLang uses Dimensional Analysis (checking units like meters vs. seconds) and Symmetry (checking if the rules change if you flip the world upside down).
The Result: It throws away 71% of all possible "nonsense" equations before it even tries to solve them. It only looks at sentences that could physically exist.

B. The "Intuitive Detective" (The Guide)

Once the library is filtered down to only "sensible" books, SymLang uses a Large Language Model (LLM) (like a super-smart version of the AI you are talking to now) to guess the answer.

The Analogy: Instead of picking a book at random, the detective looks at the clues (the data) and says, "Hmm, the data looks like a pendulum. I bet the answer involves sine waves and gravity."
How it works: The AI is trained on millions of physics problems. It looks at the messy data and proposes the most likely formulas, skipping the ones that are mathematically possible but physically unlikely.

C. The "Truth Detector" (The Judge)

Finally, SymLang doesn't just pick the "best" answer and stop. It asks, "Are we sure?"

The Analogy: Imagine a jury. Instead of just one foreman saying "Guilty," the jury simulates 200 different trials with slightly different evidence.
How it works:
- If the answer is the same in all 200 trials, SymLang says, "We are 100% sure."
- If the jury is split 50/50 between two different formulas, SymLang says, "We are confused. The data isn't clear enough to pick one."
- Crucially: Most AI tools lie and give you a confident answer even when they are wrong. SymLang is honest. It admits when the data is insufficient.

3. Why This Matters (The Results)

The authors tested SymLang on 133 different scientific problems, from electricity to population growth, with very noisy data.

It's Faster: It found the correct formula 4 times faster than the next best method because it didn't waste time on nonsense.
It's Stronger: Even when 50% of the data was hidden (like trying to solve a puzzle with half the pieces missing), SymLang still found the right answer 61% of the time, while others failed.
It's Honest: When the data was too messy to solve, SymLang raised a red flag saying, "I can't tell." Other methods just gave a confident, wrong answer.

The Big Picture

Think of SymLang as a physics-aware GPS.

Old GPS: "I think the destination is here." (Even if it's in the middle of a lake).
SymLang: "Based on the laws of physics, you can't drive through water. Also, the map is blurry, so I can't be 100% sure of the route, but here are the top 3 possibilities, and here is where I need more data to be certain."

This framework bridges the gap between raw, messy data and the clean, beautiful laws of physics that scientists have been chasing for centuries. It turns "data mining" into "scientific discovery."

Here is a detailed technical summary of the paper "Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations."

1. Problem Statement

The discovery of compact, symbolic governing equations (e.g., Newton's laws, Maxwell's equations) from experimental data is a fundamental goal of quantitative science. However, existing automated methods face three critical limitations when dealing with real-world data:

Noise and Derivative Estimation: Experimental measurements are noisy, making derivative estimation (required for differential equations) unstable.
Partial Observability: Relevant state variables are often unobserved, meaning only projected or effective dynamics are accessible.
Structural Uncertainty: When multiple symbolic structures explain the data equally well, existing methods typically return a single "best" equation, providing no measure of uncertainty or structural degeneracy. This can lead to overinterpretation of finite-sample coincidences as fundamental laws.

Current approaches like Sparse Identification of Nonlinear Dynamics (SINDy) are limited by fixed operator libraries (missing equations outside the library), while neural-symbolic approaches (e.g., AI Feynman, DSR) often lack rigorous physical constraints or principled uncertainty quantification.

2. Methodology: The SymLang Framework

The authors introduce SymLang, a unified framework that integrates three distinct concepts: typed symmetry-constrained grammars, language-model-guided program synthesis, and MDL-regularized Bayesian model selection. The pipeline operates in five modular stages:

Stage 1: Preprocessing and Derivative Estimation

Raw observations are smoothed and differentiated to estimate $\dot{y}$ . To handle noise, SymLang solves a variational problem using either:

Smoothing Splines: For smooth signals, minimizing a trade-off between data fidelity and second-derivative smoothness.
Total Variation (TV) Regularization: For discontinuous signals.
The method selection is automated via cross-validation on held-out data segments.

Stage 2: Nondimensionalization and Unit Constraints

Variables are rescaled to dimensionless forms using characteristic scales derived from data statistics. This enables Buckingham $\Pi$ theorem application. A type-consistent grammar is constructed where every production rule enforces dimensional consistency (e.g., ensuring arguments to sin are dimensionless). This acts as a hard filter, eliminating physically impossible expressions before any fitting occurs.

Stage 3: Symmetry-Constrained Grammar Construction

Beyond units, the framework encodes physical symmetries as hard production rules in a Context-Free Grammar (CFG):

Parity: Enforces odd/even dynamics (e.g., restoring forces).
Rotational Invariance: Restricts dependencies to group invariants (e.g., $\|x\|^2$ ) rather than individual Cartesian components.
Time-Translation: Removes explicit time dependence for autonomous systems.
Galilean/Lorentz Invariance: Enforces relative velocity/distance dependencies.
Impact: These constraints prune the search space by an average of 71.3% before any candidate evaluation.

Stage 4: Language-Guided Program Synthesis

A fine-tuned 7B-parameter decoder-only transformer acts as a "proposer."

Input: It receives a concise, interpretable descriptor vector of the data (spectral features, symmetry scores, correlation structures).
Output: It autoregressively generates S-expression strings representing expression trees consistent with the grammar's type system.
Training: The model is trained on 820,000 pairs of data summaries and expressions from known physical systems.
Efficiency: This guides the search toward plausible forms, reducing the need for random sampling.

Stage 5: Constant Fitting and Model Selection

Fitting: For each candidate structure, scalar constants are optimized using L-BFGS-B. A soft physical penalty is added if conserved quantities are detected in the data.
Selection (MDL): Candidates are ranked using Minimum Description Length (MDL) scoring, which balances likelihood against the complexity of encoding the tree structure and constants. This prevents overfitting.
Uncertainty Quantification: A block-bootstrap procedure assesses structural stability. If a structure's rank fluctuates significantly across resamples, it is flagged as unstable. The system also computes Fisher Information to detect non-identifiable parameters.

3. Key Contributions

Unified Framework: The first system to combine typed grammars (for hard physical constraints), LLMs (for efficient search guidance), and Bayesian model selection (for uncertainty) in a single pipeline.
Hard Physical Constraints: By encoding dimensional analysis and group-theoretic invariance directly into the grammar, the method eliminates the vast majority of physically nonsensical candidates a priori.
Epistemic Honesty: Unlike baselines that return a single point estimate, SymLang explicitly reports structural degeneracy (e.g., "two equations are equally valid") and flags non-identifiable systems, preventing false scientific conclusions.
Partial Observability Handling: It offers strategies for effective dynamics learning and latent variable augmentation, outperforming baselines when state variables are hidden.

4. Experimental Results

The framework was evaluated on a benchmark of 133 dynamical systems across five domains (Classical Mechanics, Electrodynamics, Thermodynamics, Population Dynamics, Nonlinear Oscillators) under varying noise levels and partial observability.

Structural Recovery: Under 10% observational noise, SymLang achieved an 83.7% exact structural recovery rate. This is a 22.4 percentage-point improvement over the next-best baseline (PySR) and significantly outperforms SINDy and DSR.
Robustness to Noise: At 30% noise, recovery remained at 64.8%, whereas baselines dropped below 45%.
Extrapolation (OOD): SymLang reduced Out-of-Distribution (OOD) extrapolation error by 61% compared to PySR.
Physical Consistency: It nearly eliminated conservation-law violations (Physical drift: $3.1 \times 10^{-3} $vs.$ 187.3 \times 10^{-3}$ for the closest competitor).
Partial Observability: Under 50% state occlusion, SymLang recovered the correct structure in 61.2% of cases, compared to 38.4% for DSR. Crucially, it correctly identified 91.3% of non-identifiable systems as ambiguous, whereas all baselines returned a confident but incorrect equation.
Sample Efficiency: SymLang reached 80% recovery with ~~4,800 time steps, requiring 4x fewer samples than PySR (~~19,000 steps).

5. Significance and Impact

Scientific Rigor: SymLang shifts the paradigm from "finding the best equation" to "characterizing the space of valid equations." By explicitly reporting uncertainty and degeneracy, it aids scientists in designing targeted experiments to resolve ambiguities.
Physical Auditability: Every discovered law is guaranteed to satisfy unit consistency and symmetry constraints by construction, ensuring scientific credibility that black-box neural models cannot provide.
Reproducibility: The framework is fully open-source, with a rigorous benchmark and reproducible experimental setup, setting a new standard for symbolic regression research.

In summary, SymLang represents a significant leap forward in automated scientific discovery, successfully bridging the gap between data-driven machine learning and first-principles physical reasoning to handle the noise, partial observability, and uncertainty inherent in real-world experiments.