Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations

SymLang is an open-source framework that integrates symmetry-constrained grammars, language-model-guided program synthesis, and Bayesian model selection to robustly discover accurate, interpretable governing equations from noisy and partial observations, significantly outperforming existing baselines in structural recovery and physical consistency.

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to solve a mystery: What are the hidden rules of nature?

Scientists have always wanted to find simple, elegant formulas (like F=maF=ma) that explain how the world works, from the swing of a pendulum to the spread of a virus. But in the real world, data is messy. It's full of static (noise), missing pieces (unobserved variables), and confusing patterns.

The paper introduces a new detective tool called SymLang. Think of it as a super-smart, physics-aware AI that doesn't just guess; it reasons its way to the truth.

Here is how SymLang works, broken down into simple analogies:

1. The Problem: The "Infinite Library" Trap

Imagine you are trying to find a specific book in a library that contains every possible sentence ever written, including gibberish like "Purple clouds eat Tuesday."

  • Old methods (like genetic programming) would randomly pick books, read them, and see if they make sense. This takes forever because 99.9% of the books are nonsense.
  • Other methods (like SINDy) only look in a small, pre-selected section of the library. If the answer is in the "Science Fiction" section but they only look in "History," they will never find it.

2. The Solution: SymLang's Three Superpowers

SymLang solves this by combining three distinct ideas into one powerful workflow.

A. The "Grammar of Physics" (The Filter)

Before the AI even starts guessing, it builds a filter based on the laws of physics.

  • The Analogy: Imagine a strict librarian who says, "You cannot write a sentence where a 'Time' word is added to a 'Distance' word. That makes no sense!"
  • How it works: SymLang uses Dimensional Analysis (checking units like meters vs. seconds) and Symmetry (checking if the rules change if you flip the world upside down).
  • The Result: It throws away 71% of all possible "nonsense" equations before it even tries to solve them. It only looks at sentences that could physically exist.

B. The "Intuitive Detective" (The Guide)

Once the library is filtered down to only "sensible" books, SymLang uses a Large Language Model (LLM) (like a super-smart version of the AI you are talking to now) to guess the answer.

  • The Analogy: Instead of picking a book at random, the detective looks at the clues (the data) and says, "Hmm, the data looks like a pendulum. I bet the answer involves sine waves and gravity."
  • How it works: The AI is trained on millions of physics problems. It looks at the messy data and proposes the most likely formulas, skipping the ones that are mathematically possible but physically unlikely.

C. The "Truth Detector" (The Judge)

Finally, SymLang doesn't just pick the "best" answer and stop. It asks, "Are we sure?"

  • The Analogy: Imagine a jury. Instead of just one foreman saying "Guilty," the jury simulates 200 different trials with slightly different evidence.
  • How it works:
    • If the answer is the same in all 200 trials, SymLang says, "We are 100% sure."
    • If the jury is split 50/50 between two different formulas, SymLang says, "We are confused. The data isn't clear enough to pick one."
    • Crucially: Most AI tools lie and give you a confident answer even when they are wrong. SymLang is honest. It admits when the data is insufficient.

3. Why This Matters (The Results)

The authors tested SymLang on 133 different scientific problems, from electricity to population growth, with very noisy data.

  • It's Faster: It found the correct formula 4 times faster than the next best method because it didn't waste time on nonsense.
  • It's Stronger: Even when 50% of the data was hidden (like trying to solve a puzzle with half the pieces missing), SymLang still found the right answer 61% of the time, while others failed.
  • It's Honest: When the data was too messy to solve, SymLang raised a red flag saying, "I can't tell." Other methods just gave a confident, wrong answer.

The Big Picture

Think of SymLang as a physics-aware GPS.

  • Old GPS: "I think the destination is here." (Even if it's in the middle of a lake).
  • SymLang: "Based on the laws of physics, you can't drive through water. Also, the map is blurry, so I can't be 100% sure of the route, but here are the top 3 possibilities, and here is where I need more data to be certain."

This framework bridges the gap between raw, messy data and the clean, beautiful laws of physics that scientists have been chasing for centuries. It turns "data mining" into "scientific discovery."