VaSST: Variational Inference for Symbolic Regression using Soft Symbolic Trees

The Big Picture: Finding the "Recipe" in the Chaos

Imagine you are a detective trying to figure out the secret recipe for a delicious cake. You have a list of ingredients (flour, sugar, eggs) and the final taste of the cake, but you don't know the instructions.

Standard Machine Learning is like a chef who memorizes the taste of thousands of cakes. They can predict how a new cake will taste, but if you ask, "How did you make it?" they just say, "I used a black box." They can't write down the recipe.
Symbolic Regression is the detective's goal: to find the actual written recipe (the mathematical equation) that explains why the cake tastes the way it does.

The problem is that the universe of possible recipes is astronomically huge. It's like trying to find the one correct sentence in a library containing every possible combination of words in the English language. Most current methods are like a person randomly typing words on a keyboard, hoping to stumble upon a sentence. It takes forever, and they often get stuck typing gibberish.

Enter VaSST: The "Soft" Detective

The authors introduce VaSST (Variational Inference for Symbolic Regression using Soft Symbolic Trees). Here is how it works, broken down into three simple concepts:

1. The "Soft" Tree (The Clay Metaphor)

Imagine you are building a tree structure out of hard, rigid Lego bricks.

Old Methods: You have to snap the bricks together one by one. If you put a "plus" sign in the wrong spot, you have to take the whole thing apart and start over. This is slow and frustrating.
VaSST's Approach: Instead of hard bricks, VaSST uses soft, moldable clay.
- At the top of the tree, the clay isn't just "plus" or "minus." It's a mixture of both. It's 60% "plus" and 40% "minus."
- This "softness" allows the computer to use gradient descent (a smooth sliding motion) to find the best shape, rather than jumping around randomly. It's like molding a statue with your hands instead of chiseling it with a hammer.

2. The "Annealing" Process (The Cooling Metal)

You can't bake a cake with raw, liquid batter. Eventually, you need it to be solid.

VaSST starts with the "clay" very soft (high temperature), allowing it to explore many different shapes easily.
As the computer learns, it slowly cools down (a process called annealing).
The soft clay gradually hardens into specific, solid Lego bricks. By the end, the "mixture" of 60% plus and 40% minus has solidified into a definitive "plus" sign because that's what fit the data best.

3. The "Uncertainty" Superpower

Most detectives are confident they found the only answer. But what if there are two recipes that taste the same?

Because VaSST is built on probability, it doesn't just give you one answer. It gives you a confidence score.
It can say: "I am 90% sure the recipe is $A + B$ , but there's a 10% chance it might be $A \times B$ ."
This is crucial for science. If a scientist is designing a bridge based on a formula, they need to know if that formula is a rock-solid fact or a risky guess. VaSST tells them exactly how risky the guess is.

Why Is This a Big Deal?

The paper compares VaSST to other top detectives (like Genetic Programming and Bayesian Machine Scientists) using famous physics equations (like gravity and electricity).

Speed: VaSST is much faster. While others were taking hours to search the library, VaSST found the recipe in minutes.
Accuracy: It found the correct "recipes" (equations) even when the data was noisy (like a cake recipe tested with a broken scale).
Simplicity: It follows Occam's Razor. If a simple recipe explains the data, VaSST won't invent a complicated one with unnecessary ingredients. It naturally avoids "overfitting" (memorizing the noise instead of the law).

The Takeaway

VaSST is a new, super-smart way to discover the laws of nature from data.

It turns a messy, impossible puzzle (finding a needle in a haystack) into a smooth, sliding puzzle (molding clay).
It finds the simplest, most accurate mathematical "recipes" for how the world works.
And unlike other methods, it tells you how sure it is about its answer.

It's like giving scientists a GPS that doesn't just tell them where they are, but also draws the map of the terrain they are driving through, complete with a warning label saying, "This part of the map is a bit foggy, proceed with caution."

1. Problem Statement

Symbolic Regression (SR) aims to discover explicit, closed-form mathematical expressions from data that reveal underlying physical laws. While crucial for scientific discovery (SciML), existing methods face significant limitations:

Heuristic Search: Genetic Programming (GP) and evolutionary algorithms often suffer from high computational complexity, sensitivity to initialization, and the generation of overly complex formulas.
Data-Intensive ML: Neural network-based approaches (e.g., Transformers) often require large datasets, assume low-noise regimes, and lack interpretability.
Probabilistic Limitations: Fully probabilistic Bayesian methods (e.g., MCMC-based) struggle with the highly multimodal and combinatorial nature of the symbolic expression space. They often exhibit poor mixing, slow convergence, and inefficient exploration, leading to suboptimal structural recovery.
Lack of Uncertainty Quantification: Most existing methods provide a single "best" expression without principled uncertainty quantification regarding the structural form of the equation.

2. Methodology: The VaSST Framework

The authors propose VaSST (Variational Inference for Symbolic Regression using Soft Symbolic Trees), a scalable probabilistic framework that transforms the discrete combinatorial search into a continuous optimization problem.

A. Soft Symbolic Trees (Continuous Relaxation)

The core innovation is the representation of symbolic trees as Soft Symbolic Trees. Instead of discrete choices for operators and features, VaSST uses continuous relaxations:

Skeleton Structure: Each symbolic tree is embedded in a fixed-depth full binary tree skeleton.
Soft Variables:
- Expansion Indicator ( $e_{j\zeta}$ ): Modeled via Binary Concrete relaxation (a continuous relaxation of Bernoulli) to determine if a node is a leaf or an internal node.
- Operator Assignment ( $o_{j\zeta}$ ): Modeled via Gumbel-Softmax relaxation (continuous relaxation of Categorical) to represent a soft mixture of all allowable unary/binary operators.
- Feature Assignment ( $h_{j\zeta}$ ): Modeled via Gumbel-Softmax to represent a soft mixture of input features.
Evaluation: The evaluation of a soft tree is a recursive, differentiable process where node outputs are convex combinations of possible operations and features, weighted by the soft probabilities. This allows for gradient-based optimization via automatic differentiation.

B. Probabilistic Modeling

VaSST formulates SR as a Bayesian inference problem:

Likelihood: A linear ensemble model where the response $y$ is a linear combination of $K$ symbolic trees plus Gaussian noise.
Priors:
- Regression Coefficients & Noise: Conjugate Normal-Inverse-Gamma (NIG) priors.
- Tree Structure: A hierarchical prior over the skeleton variables. Crucially, it employs a depth-dependent split probability ( $p_\zeta = \alpha(1+d_\zeta)^{-\delta}$ ) to penalize deep, complex trees, enforcing Occam's Razor (structural parsimony).
Inference: The authors use Black-Box Variational Inference (BBVI). They maximize the Evidence Lower Bound (ELBO) using stochastic gradient descent (AdamW).
- The ELBO includes a Monte Carlo approximation of the marginal likelihood (integrating out regression parameters) and the Kullback-Leibler (KL) divergence between the variational posterior and the prior.
- Annealing: Temperature parameters in the relaxations are annealed from high (smooth mixtures) to low (near-discrete) during training to balance exploration and exploitation.

C. Post-Optimization and Uncertainty Quantification

After optimizing the variational parameters ( $\phi^*$ ):

Hard Tree Sampling: The continuous soft variables are sampled to generate $H$ discrete "hard" symbolic trees.
Ensemble Ranking: These $H$ candidates are evaluated, and the top structures are ranked by in-sample Root Mean Squared Error (RMSE).
Uncertainty: The distribution of these top candidates provides a principled measure of structural uncertainty (i.e., how confident the model is in specific parts of the equation).

3. Key Contributions

Scalable Probabilistic SR: VaSST is the first framework to apply variational inference with continuous relaxations to symbolic regression, enabling efficient gradient-based optimization over the symbolic space.
Soft Symbolic Trees: Introduces a novel representation that bridges the gap between discrete structural search and continuous optimization, preserving interpretability while allowing for differentiable learning.
Principled Uncertainty Quantification: Unlike heuristic methods that output a single equation, VaSST provides a posterior distribution over symbolic structures, allowing for the assessment of structural confidence.
Structural Parsimony: The depth-dependent prior effectively penalizes overly complex expressions, preventing overfitting and adhering to scientific principles of simplicity.

4. Experimental Results

The authors evaluated VaSST on synthetic data and the Feynman Symbolic Regression Database (SRBench), comparing it against state-of-the-art methods: QLattice, gplearn (GP), DEAP (GP), Bayesian Machine Scientist (BMS), and Bayesian Symbolic Regression (BSR).

Structural Recovery: VaSST consistently recovered the exact ground-truth symbolic expressions across various noise levels (including $\sigma^2 = 0.22$ $σ^{2} = 0.22$ ) for both synthetic functions and complex physics laws (e.g., Coulomb's Law, Fourier's Law of Heat Conduction).
- Comparison: Competing Bayesian methods (BMS, BSR) often failed to recover the correct structure or produced overly complex formulas. Neural/evolutionary methods (QLattice, gplearn) often succeeded in prediction but failed in structural simplicity.
Predictive Accuracy: VaSST achieved competitive or superior out-of-sample RMSE compared to all baselines, often matching BMS and QLattice but with significantly simpler expressions.
Computational Scalability: VaSST demonstrated superior runtime performance compared to MCMC-based Bayesian methods (BMS, BSR). As sample sizes increased, VaSST remained the fastest, highlighting the efficiency of the variational approach over sampling-based methods.
Robustness: The method remained stable under high noise levels, whereas other methods tended to degrade in structural recovery or produce erratic, complex formulas.

5. Significance

VaSST represents a paradigm shift in symbolic regression by moving away from stochastic search and MCMC sampling toward differentiable probabilistic modeling.

Scientific Discovery: It provides a robust tool for scientists to not only find equations but also quantify the uncertainty in the discovered laws, which is critical for high-stakes scientific domains.
Efficiency: By leveraging automatic differentiation and variational inference, it makes Bayesian symbolic regression scalable to modern data sizes, overcoming the computational bottlenecks of previous probabilistic approaches.
Interpretability: It maintains the "white-box" nature of symbolic regression, ensuring that the resulting models are human-interpretable mathematical expressions rather than black-box neural networks.

In summary, VaSST successfully combines the interpretability of symbolic regression with the scalability of modern variational inference, offering a state-of-the-art solution for data-driven scientific discovery.