Discovery of Interpretable Physical Laws in Materials via Language-Model-Guided Symbolic Regression
This paper introduces a framework that leverages large language models to guide symbolic regression, successfully discovering accurate, interpretable, and simplified physical laws for perovskite materials while drastically reducing the search space compared to traditional methods.
Original authors:Yifeng Guan, Chuyi Liu, Dongzhan Zhou, Lei Bai, Wan-jian Yin, Jingyuan Li, Mao Su
This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer
Imagine you are trying to figure out the secret recipe for the perfect chocolate cake. You have a huge list of ingredients: flour, sugar, eggs, salt, vanilla, cocoa, baking powder, and maybe even some weird stuff like glitter or motor oil.
The Problem: The "Blind Walk" Traditional scientists (using old-school math methods called Symbolic Regression) try to find the recipe by mixing and matching every single ingredient in every possible combination. They might try "Motor Oil + Sugar" or "Glitter + Eggs."
The Issue: This is like walking through a massive library blindfolded, looking for one specific book. It takes forever, and you might accidentally write down a recipe that tastes okay but makes no sense physically (like "add 5 gallons of motor oil"). The math works, but the physics is nonsense.
The New Solution: The "Smart Librarian" This paper introduces a new tool called LangLaw. Think of LangLaw as a Super-Intelligent Librarian (a Large Language Model, or AI) who has read every science textbook ever written.
Instead of letting the blindfolded walker search the whole library, the Librarian steps in first.
The Librarian's Job: Before the search begins, the Librarian looks at the ingredients and says, "Hey, we don't need motor oil or glitter for a cake. Let's ignore those. Also, we know that flour and sugar are the main players, so let's focus on those."
Guiding the Search: The Librarian gives the blindfolded walker a tiny, focused map of just the "Flour and Sugar" section of the library.
The Result: The walker finds the perfect recipe much faster. The recipe isn't just accurate; it makes sense. It tells you why the cake rises (because of the baking powder), not just that it does.
How It Works in the Real World (Materials Science)
The researchers tested this "Smart Librarian" on three tricky problems in materials science (making new types of rocks and metals):
How hard is the rock? (Bulk Modulus)
Old Way: Tried thousands of random math formulas. Some were accurate but looked like gibberish.
LangLaw Way: The AI knew that "how much an atom wants to steal an electron" matters. It guided the math to find a simple, clean formula that explains why some rocks are soft and others are hard.
How much light can the material absorb? (Band Gap)
Old Way: Created a super-complex equation with 10 different parts that was hard to understand.
LangLaw Way: Found a much shorter, simpler equation that did the exact same job. It's like finding a shortcut that saves you 90% of the walking time.
How good is the material at making fuel? (OER Activity)
Old Way: Needed a massive amount of data to learn, and often failed when given new, rare materials.
LangLaw Way: Even with very little data (like having only 18 cake recipes to learn from), the AI used its "common sense" (scientific knowledge) to predict how new materials would behave. It was twice as good at guessing new materials as the best deep-learning computers.
Why This Matters
Speed: It reduced the search space by a factor of 100,000. Imagine searching for a needle in a haystack, but the AI tells you, "The needle is actually in this tiny box right here."
Understanding: It doesn't just give you a number; it gives you a story. It explains the physical rules behind the material, not just the result.
Small Data: It works even when we don't have millions of data points (which is common in expensive science experiments).
In a Nutshell: LangLaw is like giving a brilliant, knowledgeable professor (the AI) a team of hardworking students (the math algorithms). The professor tells the students what to look for and what to ignore, so they don't waste time on nonsense. The result is a discovery that is not only correct but also easy for humans to understand and use to build better materials.
1. Problem Statement
The accurate prediction of physical properties in materials science is a critical objective, yet current approaches face a dichotomy:
Deep Learning (e.g., GNNs): While highly accurate, these methods operate as "black boxes," failing to provide explicit formulas or insights into underlying physical mechanisms.
Traditional Symbolic Regression (SR): Methods like genetic programming, SINDy, and HI-SISSO aim to discover explicit mathematical formulas. However, without prior physical knowledge, they often perform a "blind search" through a vast combinatorial space of possible expressions. This leads to:
Combinatorial Explosion: The search space becomes unmanageable.
Unphysical Results: Algorithms often merge statistically correlated but physically irrelevant variables, producing complex formulas that fit data well but lack physical interpretability.
LLM Limitations: While Large Language Models (LLMs) possess scientific knowledge, they struggle to directly process complex numerical patterns and high-dimensional data to extract valid mathematical structures on their own.
The core challenge is to discover interpretable, accurate, and simple physical laws from limited, high-dimensional materials data without falling into the trap of overfitting or unphysical complexity.
2. Methodology: The LangLaw Framework
The authors propose LangLaw, a hybrid framework that integrates the robust search capabilities of Symbolic Regression with the scientific reasoning and knowledge of Large Language Models. The system operates as an iterative loop:
LLM-Guided Feature Selection & Pruning:
The LLM (specifically Intern-S1, a multimodal foundation model enhanced for scientific reasoning) analyzes textual descriptions of input features (e.g., electronegativity, atomic radii).
Based on its embedded scientific knowledge, the LLM filters out physically irrelevant variables, even if they show statistical correlation.
It generates specific search parameters and selects a subset of relevant features, reducing the effective search space by a factor of approximately 105.
Symbolic Regression (SR) Engine:
A PySR-based engine (using multi-island genetic programming) performs the mathematical search within the constrained feature space defined by the LLM.
It evolves mathematical expressions (represented as tree structures) via selection, crossover, and mutation to find candidate formulas.
Continuous constants are optimized using gradient-based methods.
Experience Pool & Feedback Loop:
Results from each SR iteration (formulas, parameters, fitting errors) are stored in an "Experience Pool."
The LLM reviews this historical data to identify effective variable combinations and refine its instructions for the next round.
This feedback mechanism progressively narrows the search space toward physically meaningful equations.
Pareto Optimization:
The system outputs a set of formulas lying on the Pareto front, balancing high accuracy (low error) with low complexity (simplicity).
3. Key Contributions
Novel Framework: Introduced LangLaw, the first framework to use LLMs not as end-to-end predictors, but as knowledge-guided search engines to direct Symbolic Regression.
Efficiency: Successfully mitigated the combinatorial explosion in SR, reducing the search space by ∼105 times.
Interpretability: Discovered formulas that are not only accurate but also offer clear physical insights, avoiding the "black box" nature of deep learning and the "unphysical" nature of traditional SR.
Small-Data Robustness: Demonstrated that leveraging LLM priors allows for effective law discovery even when experimental data is scarce (a common issue in materials science).
4. Results
The framework was validated on three distinct materials property datasets:
A. Perovskite Bulk Modulus (B0)
Task: Predict mechanical stability.
Comparison: Outperformed the empirical formula by Verma & Kumar and the HI-SISSO method.
Discovery: Identified a linear formula: B0=−(EAB/IPB)+0.51(a0nA+25.7−ENB)−1.75.
Insight: The formula revealed that bulk modulus is governed by the "softness" of the electron cloud (ratio of electron affinity to ionization potential) and ionic corrections.
Generalization: On Out-of-Distribution (OOD) data (double perovskites), LangLaw achieved significantly lower prediction errors than HI-SISSO, proving superior transferability.
B. Band Gap of Lead-Free Double Perovskites
Task: Predict optoelectronic properties (Eg).
Comparison: Compared against SISSO.
Discovery: Found a concise formula: Eg=0.056(VB4XX3)+RXVAXB′22.66.
Insight: Confirmed the dominance of valence electrons and anion radii. The LangLaw formula was more concise than the SISSO equivalent while maintaining similar accuracy, effectively treating minor variations (like square roots of valence) as constants.
Comparison: Compared against GPSR (Genetic Programming SR).
Discovery: Identified a formula relating activity to geometric factors (octahedral factor μ and tolerance factor t).
Insight: The model revealed that the tolerance factor t has a negligible influence (coefficient ≈0.0016), suggesting activity is primarily driven by the local geometry (μ).
Performance: Achieved higher accuracy with fewer data points (18 samples) compared to GPSR.
Comparative Performance (Table 1)
vs. Deep Learning (CGCNN, ALIGNN): LangLaw significantly outperformed deep learning models on small datasets, particularly on OOD data where DL models overfit. For Bulk Modulus OOD data, LangLaw's RMSE (0.0851) was half that of ALIGNN and five times lower than CGCNN.
vs. LLM-SR: LangLaw produced simpler formulas with lower predictive errors compared to direct LLM-based SR approaches.
5. Significance
Paradigm Shift in Materials Discovery: Moves LLMs beyond text generation or simple prediction to acting as active scientific reasoning agents that shape the discovery of fundamental laws.
Bridging Data and Theory: Provides a practical tool to extract governing scientific laws from complex, real-world data, offering a middle ground between purely data-driven black boxes and purely theoretical derivations.
Scalability: The method is particularly valuable for domains where data is scarce, as the LLM's prior knowledge compensates for the lack of large training sets.
Open Science: The authors have released the code and datasets, facilitating reproducibility and further development in interpretable AI for science.
In conclusion, LangLaw demonstrates that integrating the reasoning capabilities of LLMs with the mathematical rigor of Symbolic Regression is a powerful strategy for discovering accurate, interpretable, and transferable physical laws in materials science.