Code World Models for Parameter Control in Evolutionary Algorithms

Imagine you are trying to teach a robot how to solve a maze. Usually, you have two options:

The Rulebook: You give the robot a strict manual on how to turn left or right based on the wall colors.
The Trial-and-Error: You let the robot run into walls thousands of times, hoping it eventually learns the pattern by accident.

This paper introduces a third, smarter option: The "Dream Simulator."

The researchers asked a question: Can we teach an AI (specifically a Large Language Model, or LLM) to watch a robot fail at solving a problem, understand why it failed, and then write its own "dream simulator" to figure out the perfect strategy?

Here is the breakdown of their method, "Code World Models" (CWM), using simple analogies.

1. The Setup: The Robot and the Maze

The "robot" is a standard evolutionary algorithm (a type of AI that evolves solutions). Its job is to find the best answer to a math problem.

The Problem: The robot has a "knob" it can turn (called parameter $k$ $k$ ). This knob controls how wildly it changes its solution.
- Turn it low: It makes tiny, safe adjustments. Good for fine-tuning.
- Turn it high: It makes huge, chaotic jumps. Good for escaping dead ends, but risky.
The Challenge: The robot doesn't know when to turn the knob up or down. If it turns it the wrong way, it gets stuck in a "deceptive valley"—a trap that looks like the top of a hill but is actually a pit.

2. The Old Way vs. The New Way

The Old Way (Adaptive Rules): Traditional methods use simple rules like, "If you don't improve, turn the knob down." This works great on smooth hills but fails miserably in deceptive valleys because the robot keeps turning the knob down until it's stuck forever.
The New Way (Code World Models):
1. Watch: The researchers let the robot run with random settings, collecting a bunch of "failed" or "sub-optimal" attempts.
2. Ask the Oracle: They feed these failed attempts to a super-smart AI (the LLM) and say, "Look at these mistakes. Can you write a Python program that predicts what will happen if we change the knob?"
3. The Magic: The LLM doesn't just guess; it writes a simulator. It creates a piece of code that acts like a crystal ball.
4. The Plan: Before the robot makes a move in the real world, it asks the simulator: "If I turn the knob to 5, what happens? If I turn it to 10, what happens?" The simulator answers instantly. The robot then picks the best move.

3. The Results: Beating the Traps

The researchers tested this on four different types of "mazes":

The Smooth Hills (LeadingOnes & OneMax): These are easy problems. The new method learned the perfect strategy just by watching the failures. It performed almost as well as the theoretical "perfect" strategy, even though it never saw the perfect strategy during training.
The Deceptive Valley (Jumpk): This is the big win. In this maze, the robot gets stuck in a pit. Traditional rules say, "You're stuck, so be more careful (turn knob down)." This leads to a 0% success rate.
- The CWM Solution: The simulator realized, "Hey, being careful won't work here. We need to jump hard to get out!" It figured out the exact jump size needed to escape the trap.
- Result: While other methods failed 100% of the time, this method succeeded 100% of the time.
The Rugged Mountain (NK-Landscape): This is a messy, chaotic maze with no clear rules. The LLM couldn't use math formulas here. Instead, it looked at a table of "what happened when we tried X." It learned the pattern purely from data and still beat all other methods.

4. Why This is a Big Deal

It's Cheaper: The new method learned from 200 "offline" attempts. A competing AI (DQN) needed 500 "online" attempts (where the AI actually has to run the simulation, which is slow and expensive) and still failed more often.
It's Transparent: Instead of a "black box" neural network where you don't know why it made a decision, the LLM wrote actual Python code. You can read the code and see exactly how it decided to turn the knob. It's like the AI wrote its own instruction manual.
It Generalizes: The robot learned the strategy for a specific maze size, and when they made the maze bigger or changed the rules slightly, the robot still knew what to do. It didn't just memorize; it understood the logic.

The Bottom Line

This paper shows that we don't need to hand-code every rule for AI. Instead, we can show an AI some examples of things going wrong, ask it to write a simulator to understand the physics of the problem, and then let that simulator guide the AI to success.

It's like giving a chess player a book of their own past losses, asking them to write a new rulebook based on those losses, and then having them play a perfect game using that new rulebook. The result? They beat the experts.

1. Problem Statement

The paper addresses the challenge of adaptive parameter control in Evolutionary Algorithms (EAs), specifically for the (1+1)-RLS $_k$ algorithm. In this algorithm, a bitstring is mutated by flipping exactly $k$ bits at each step. The core difficulty lies in determining the optimal mutation strength $k$ at every step of the optimization process.

The Challenge: While optimal policies ( $k^*(i)$ ) are known for simple unimodal landscapes (e.g., LeadingOnes, OneMax), they are unknown for complex, deceptive, or rugged landscapes (e.g., Jump $_k$ , NK-Landscape).
Limitations of Existing Methods:
- Heuristic Rules: Traditional adaptive rules (e.g., $1/5$ -rule, self-adjusting mechanisms) rely on multiplicative updates. These often fail on deceptive landscapes because they decrease $k$ during stagnation, whereas escaping a "deceptive valley" often requires increasing $k$ .
- Reinforcement Learning (RL): Deep RL methods (like DQN) struggle with sample efficiency and generalization, often failing to discover rare but critical transitions (like crossing a valley) due to exploration limitations.
- Model-Free Approaches: Many approaches lack a structural understanding of the problem dynamics.

2. Methodology: Code World Models (CWMs)

The authors extend Code World Models (CWMs)—originally designed for deterministic games—to stochastic combinatorial optimization. Instead of using neural networks to approximate value functions, the method uses a Large Language Model (LLM) to synthesize a Python program that acts as a simulator (world model) of the optimizer's dynamics.

The Pipeline

The process consists of three stages:

Trajectory Collection (Offline):
- The system runs the (1+1)-RLS $_k$ algorithm using diverse, sub-optimal policies (e.g., random, fixed- $k$ , decreasing- $k$ ) for 50 runs each.
- Crucial Constraint: None of the collection policies use the optimal $k^*$ or oracle knowledge (e.g., the gap size $k_{jump}$ in Jump $_k$ ). The CWM must infer the correct strategy solely from these sub-optimal demonstrations.
- For complex problems (Jump $_k$ , NK), the prompt is enriched with empirical transition statistics (e.g., probability of improvement $P(\text{improve} | \text{fitness}, k)$ and mean fitness change) derived from the collected data.
CWM Synthesis (LLM Step):
- An LLM (Claude Sonnet 4) receives the problem description, mathematical definitions, and the empirical transition data.
- The LLM generates a Python class (SynthesizedCWM) with methods to:
  - predict_next_state: Simulate the outcome of flipping $k$ bits.
  - evaluate_state: Score the predicted state (using a normalized fitness trick to capture expected drift).
  - get_legal_actions: Return valid $k$ values.
- The generated code undergoes automated validation and refinement (up to 5 attempts) to ensure correctness.
Greedy Planning (Online):
- During optimization, a greedy planner queries the synthesized CWM at every step.
- It performs a one-step lookahead: $k^* = \arg\max_k \text{evaluate}(\text{predict}(s, k))$ .
- The algorithm executes the step with the best predicted outcome.
- Note: Unlike the original CWM framework which used Monte Carlo Tree Search (MCTS), this work finds that greedy planning suffices for Markovian parameter control, significantly reducing computational overhead.

3. Key Contributions

Extension to Stochastic Optimization: Successfully adapted CWMs from deterministic games to stochastic combinatorial optimization, proving that LLMs can synthesize probabilistic transition models.
Greedy Planning Sufficiency: Demonstrated that for parameter control, simple one-step lookahead over the LLM-synthesized model outperforms complex MCTS rollouts, which were found to collapse to sub-optimal actions in this context.
Solving Deceptive Landscapes: Achieved a 100% success rate on the deceptive Jump $_k$ benchmark, a problem where all standard adaptive baselines (including heavy-tailed mutation heuristics without oracle knowledge) failed (0% success).
Data-Driven Modeling without Oracles: Showed that for problems without closed-form models (NK-Landscape), empirical transition statistics provided in the prompt allow the LLM to synthesize a model that outperforms all baselines, effectively substituting for formal mathematical models.
Superiority over Deep RL: Demonstrated that CWMs are more sample-efficient (200 offline trajectories vs. 500 online episodes for DQN), more robust to generalization (78% success on unseen $k$ values vs. 0% for DQN), and produce auditable code rather than opaque neural weights.

4. Experimental Results

Sample Efficiency: CWM required only 200 offline trajectories to outperform DQN trained on 500 online episodes.
Robustness: Results were stable across 5 independent synthesis runs with varying temperatures.

5. Significance and Implications

Bridging Theory and AI: The paper proposes a novel paradigm where LLMs do not replace formal analysis but complement it. By forcing the LLM to output auditable Python code, the method translates statistical experience into explicit, analyzable heuristics.
Solving the "Deceptive Valley" Problem: The results on Jump $_k$ are particularly significant. They demonstrate that an LLM can infer the counter-intuitive strategy of increasing mutation strength during stagnation—a behavior that standard multiplicative update rules fundamentally cannot learn without specific problem knowledge.
Beyond Neural Networks: The work suggests that for structured control problems, symbolic world models (code) synthesized by LLMs may be superior to neural world models (DQN) in terms of sample efficiency, interpretability, and generalization.
Practicality: The cost of synthesizing a CWM is negligible (~$0.04 per call), and the resulting policy is a lightweight Python script that can be deployed without heavy inference infrastructure.

In conclusion, the paper establishes that LLMs can act as powerful "meta-learners" for evolutionary algorithms, synthesizing accurate world models from sub-optimal data to drive parameter control that matches or exceeds theoretical optima, even in the absence of closed-form mathematical models.

Code World Models for Parameter Control in Evolutionary Algorithms

1. The Setup: The Robot and the Maze

2. The Old Way vs. The New Way

3. The Results: Beating the Traps

4. Why This is a Big Deal

The Bottom Line

1. Problem Statement

2. Methodology: Code World Models (CWMs)

The Pipeline

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank