LLM-Meta-SR: In-Context Learning for Evolving Selection Operators in Symbolic Regression

The Big Picture: Teaching a Robot to Be a Better Coach

Imagine you are trying to teach a robot how to solve math puzzles (specifically, finding the hidden formula behind a set of data points). This is called Symbolic Regression.

To solve these puzzles, the robot uses a method called Genetic Programming. Think of this like a digital evolution:

The robot creates thousands of random math formulas.
It tests them to see which ones work best.
It keeps the winners, mixes their "genes" (parts of the formulas), and creates a new generation.
It repeats this until it finds the perfect formula.

The Problem: In this process, there is a crucial step called Selection. The robot needs a "Coach" to decide which formulas get to reproduce and which get deleted.

Old Way: Human experts had to manually design this Coach. They would say, "Pick the ones with the lowest error," or "Pick a mix of different types." This is hard, slow, and often misses the best strategies.
New Way (This Paper): The authors used a Large Language Model (LLM)—like a super-smart AI that reads code—to automatically design the Coach itself.

They call this LLM-Meta-SR. Instead of just solving the math problem, the AI is solving the problem of how to solve math problems.

The Three Big Hurdles (and how they fixed them)

When they first tried to let the AI design the Coach, they ran into three major problems. Here is how they solved them using clever tricks:

1. The "Average Joke" Problem (Semantic Awareness)

The Issue: Imagine you have two athletes.

Athlete A is amazing at running but terrible at swimming.
Athlete B is terrible at running but amazing at swimming.
If you only look at their average score, they might look exactly the same. If the AI picks them as "parents" just because they have the same average score, the baby athlete might be mediocre at both.

The Fix: The authors taught the AI to look at the details (the "semantics"). Instead of just saying "Good job," the AI looks at where the athlete succeeded. It pairs the Running Specialist with the Swimming Specialist to create a "Super Athlete" who is good at everything. This is called Complementary Selection.

2. The "Bloat" Problem (Code Bloat)

The Issue: AI models love to talk. When asked to write a simple instruction, they sometimes write a 10-page essay when a 1-page memo would do. In code, this is called Bloat. The AI would write selection rules that were thousands of lines long, filled with unnecessary steps. This made the system slow and hard to understand.

The Fix:

The "Word Count" Rule: They told the AI in the prompt: "You must write this code in under 50 lines."
The "Survival of the Fittest" Rule: When choosing which AI-generated coaches survive to the next round, they didn't just pick the smartest ones. They picked the ones that were smart AND short. If two coaches were equally smart, the shorter one won. This forced the AI to be concise and efficient.

3. The "Blank Page" Problem (Domain Knowledge)

The Issue: If you ask a general AI (like a smart chatbot) to design a sports coach, it might give you generic advice like "Run fast." It doesn't know the specific rules of Genetic Programming.

The Fix: They gave the AI a Cheat Sheet (a prompt with "Domain Knowledge"). They told the AI:

"Remember, we need diversity (don't pick the same type of formula twice)."
"Remember, we need simple formulas (easier to read)."
"Remember, early in the game, be adventurous; later, be precise."
By feeding these expert rules into the AI's instructions, the AI could generate a much smarter Coach.

The Results: The AI Beats the Humans

After training this system, the results were impressive:

The "Omni" Coach: The AI designed a new selection strategy called Omni. When they tested it against 9 different coaches designed by human experts, Omni won almost every time.
Better than the Best: They took the best existing math-solving algorithm (RAG-SR) and swapped its human-designed Coach with the AI-designed Omni Coach. The result? It became the best-performing algorithm out of 28 different methods tested on 116 different datasets.
Interpretability: Not only was it more accurate, but the formulas it found were also smaller and simpler (less "bloat"). This means humans can actually read and understand the math formulas it discovered, which is a huge deal in science.

The Takeaway

This paper proves that AI can now design better algorithms than human experts can.

Think of it like this:

Before: Humans spent years trying to build the perfect rulebook for a game.
Now: We built an AI that reads the rulebook, realizes the rules are clunky, and rewrites the rulebook to be faster, fairer, and more effective—all by itself.

The authors show that by combining the creativity of AI with the specific rules of the field (Symbolic Regression), we can automate the hardest part of scientific discovery: figuring out how to discover.

1. Problem Statement

Symbolic Regression (SR) aims to discover mathematical expressions that accurately model data. While Genetic Programming (GP) is a dominant approach for SR, its effectiveness heavily relies on selection operators (mechanisms that choose promising candidate solutions for the next generation).

Current Limitation: Existing selection operators (e.g., Tournament, Lexicase, Boltzmann) are manually designed by human experts. This process is labor-intensive, often requires trial-and-error, and fails to capture a unified set of desirable properties (diversity, interpretability, dynamic pressure, complementarity, and efficiency) simultaneously.
The Gap: While Large Language Models (LLMs) have recently been used to generate SR solutions (replacing crossover/mutation), they have not yet been effectively applied to automatically design the selection operators themselves.
Specific Challenges in LLM-driven Evolution:
1. Lack of Semantic Guidance: LLMs often receive only aggregate performance scores, missing fine-grained behavioral differences across specific datasets (e.g., an algorithm might be good on Dataset A but poor on B). This leads to ineffective recombination of code components.
2. Code Bloat: LLMs tend to generate overly complex and verbose code, which reduces interpretability, wastes computational tokens, and hinders evolutionary progress.

2. Methodology: LLM-Meta-SR Framework

The authors propose a meta-learning framework where an LLM evolves a population of selection operators (written as Python code) rather than evolving the symbolic expressions directly.

A. Core Workflow

The framework operates via two nested loops:

Outer Loop (Meta-Evolution): The LLM generates candidate selection operators.
Inner Loop (Symbolic Regression): Each candidate operator is evaluated by running a standard GP-SR algorithm on multiple meta-training datasets. The performance of the SR algorithm (using that specific selection operator) determines the fitness of the operator.

B. Key Innovations

To address the identified challenges, the framework introduces three critical mechanisms:

Semantics-Aware Evolution:
- Complementary Selection: Instead of selecting parents based on average performance, the framework calculates a complementarity score. It pairs a random parent with the individual that maximizes the element-wise maximum of their performance vectors across different datasets. This ensures the LLM combines operators with distinct strengths (e.g., one good on Dataset A, another on Dataset B).
- Fine-Grained Feedback: The LLM prompts include the full performance vector (scores on each dataset) rather than a single average, allowing the model to reason about specific behavioral patterns.
Bloat Control:
- Prompt-Based Constraints: The LLM is explicitly instructed in the prompt to limit code length (e.g., "Write a selection operator with code length $\le$ 30 lines").
- Multi-Objective Survival Selection: A dominance-dissimilarity mechanism is used. Operators are ranked based on a trade-off between fitness (performance) and code length. A penalty is applied to operators that are dominated (worse performance) and highly similar (redundant) to others, using the CodeBLEU metric to measure code similarity.
Domain Knowledge Integration:
- The prompts are enriched with specific design principles for selection operators:
  - Diversity-aware: Encourage selection of individuals performing well on different instances.
  - Interpretability-aware: Favor solutions with smaller tree sizes (fewer nodes).
  - Dynamic Selection Pressure: Adapt pressure from exploration (early generations) to exploitation (late generations).
  - Complementarity-aware: Favor parents with low correlation in residuals.
  - Vectorization-aware: Encourage NumPy-based vectorized operations for efficiency.

3. Key Contributions

Automated Operator Design: Demonstrated that LLMs can automatically discover selection operators that outperform nine state-of-the-art expert-designed baselines.
Solving Semantic Blindness: Introduced a semantics-aware selection strategy that leverages fine-grained performance vectors to guide the LLM in fusing complementary code blocks, avoiding the "average performance" trap.
Mitigating Code Bloat: Successfully integrated bloat control strategies (prompt constraints + multi-objective selection) to ensure evolved operators are concise, interpretable, and computationally efficient.
State-of-the-Art Performance: The evolved operator (named Omni) was integrated into a modern Transformer-assisted SR algorithm (RAG-SR), achieving the best performance among 28 algorithms across 116 regression datasets.

4. Experimental Results

Ablation Studies: Removing domain knowledge caused the largest performance drop. Removing semantic awareness or bloat control also significantly degraded results, confirming the necessity of all three components.
Comparison with Baselines: The LLM-evolved "Omni" operator significantly outperformed expert-designed operators (AutoLex, PLex, DALex, DLS, etc.) in terms of test $R^2$ scores on the SRBench dataset.
Interpretability & Efficiency:
- Tree Size: Omni produced smaller symbolic expressions (better interpretability) compared to top-performing baselines like AutoLex.
- Code Length: The bloat control mechanism kept the evolved operator code concise (~50 lines) compared to uncontrolled variants which grew to >200 lines.
- Token Cost: Bloat control significantly reduced the token count required for LLM evolution, lowering computational costs.
Generalization: The operator improved the performance of both standard GP and advanced RAG-SR frameworks, proving its versatility.

5. Significance

Beyond Human Expertise: This work provides evidence that LLMs can surpass human experts in designing specific algorithmic components for evolutionary computation, a task previously thought to require deep domain intuition.
General Framework: While focused on selection operators, the LLM-Meta-SR framework is a general-purpose approach for automated algorithm design. It can be extended to evolve crossover/mutation operators or applied to other GP-based tasks like classification.
Bridging AI and Evolution: It effectively bridges the gap between the reasoning capabilities of LLMs and the search mechanisms of Evolutionary Algorithms, creating a synergistic loop where the LLM learns from evolutionary history to design better search strategies.

In conclusion, LLM-Meta-SR represents a significant step forward in automated algorithm design, demonstrating that with the right semantic guidance and bloat control, LLMs can autonomously generate high-quality, interpretable, and superior selection operators for symbolic regression.