Self-Execution Simulation Improves Coding Models

Imagine you are teaching a brilliant but inexperienced apprentice chef how to cook a complex meal.

In the past, we taught these AI chefs (Large Language Models) by showing them millions of recipes and asking them to write down the ingredients and steps. They got very good at writing the recipe, but they often failed when they actually tried to cook the dish. They couldn't "taste" the food as they cooked it to see if it was too salty or if the sauce burned. They were blind to their own mistakes until the customer (the test) sent the dish back.

This paper introduces a new training method called Self-Execution Simulation. It's like teaching the apprentice to mentally simulate the cooking process before they even touch the stove.

Here is how the paper breaks down, using simple analogies:

1. The Problem: The "Blind Chef"

Current AI coding models are great at generating code (writing the recipe), but they are terrible at predicting what that code will actually do when it runs.

The Analogy: If you ask the AI to write a program that adds two numbers, it might write code that accidentally subtracts them. The AI doesn't "know" it made a mistake because it hasn't actually "run" the code in its head. It just guesses the output based on patterns it saw in other recipes.

2. The Solution: The "Mental Rehearsal"

The researchers taught the AI to act like a mental simulator. Instead of just writing code, the AI now learns to pause and say, "Wait, if I run this line of code with this input, the variable x will become 5, then the loop will run 3 times, and the final result will be 'Hello'."

They did this in two steps:

Step A: The Storyteller (Supervised Fine-Tuning): They took real code that had already been run by computers, recorded every single step of what happened (the "execution trace"), and asked the AI to translate that technical log into a simple story.
- Analogy: Instead of just looking at a spreadsheet of numbers, the AI reads a storybook that says, "First, the chef chopped the onions. Then, the pan got hot. Then, the onions turned golden." This teaches the AI the logic of cause and effect.
Step B: The Game Master (Reinforcement Learning): They gave the AI a game. They showed it a piece of code and an input, and asked, "What is the output?" If the AI guessed correctly, it got a point. If it guessed wrong, it lost a point.
- Analogy: This is like a cooking competition where the AI has to predict the taste of the dish before tasting it. Over time, it gets really good at predicting the outcome without needing a real taste test.

3. The Superpowers: How the AI Uses This Skill

Once the AI learned to "run code in its head," the researchers gave it two new superpowers to solve problems better:

Superpower A: The "Quality Control Inspector" (Self-Verification)

Imagine the AI is asked to write 10 different solutions to a math problem.

Old Way: The AI just picks the first one it wrote, hoping it's right.
New Way: The AI writes 10 solutions. Then, it acts as its own inspector. It mentally "runs" all 10 solutions against the test cases. It sees that Solution #3 crashes and Solution #7 gives the wrong answer. It picks Solution #10 because its mental simulation says, "This one will pass."
Result: The AI filters out its own bad ideas before submitting them, significantly increasing its success rate.

Superpower B: The "Iterative Fixer" (Self-RLEF)

Imagine the AI writes a piece of code, and it fails a test.

Old Way: The AI might just try to rewrite the whole thing from scratch, often making the same mistake.
New Way: The AI simulates the failure. It sees, "Oh, I see! When the input is '5', my code tries to divide by zero." It then says, "Aha! I need to add a check to prevent division by zero." It fixes just that part and re-simulates to make sure the fix works.
Result: It acts like a human debugger, fixing errors step-by-step based on the "ghost" of the error it simulated, rather than needing a real computer to crash and tell it what went wrong.

4. Why This Matters

Usually, to check if code works, you have to actually run it on a computer. This takes time, requires setting up complex environments, and can be expensive (like renting a kitchen for hours to test a recipe).

By teaching the AI to simulate the execution in its head:

It's Faster: No need to wait for a computer to run the code.
It's Cheaper: No need for expensive server setups.
It's Smarter: The AI learns to reason about why code works, not just what code looks like.

The Bottom Line

This paper shows that if you teach an AI to "imagine" the consequences of its code (like a chess player imagining future moves), it becomes much better at writing code that actually works. It moves the AI from being a parrot that repeats patterns to a reasoner that understands the dynamics of the programs it creates.

In short: They taught the AI to "think before it speaks," and the result is code that is far less likely to crash and far more likely to solve the problem correctly.

1. Problem Statement

Large Language Models (LLMs) have made significant strides in code generation, yet they struggle with program execution prediction. Current models often fail to faithfully simulate runtime behavior or consistently identify and explain errors in the code they generate. This limitation hinders their ability to self-correct or verify solutions without relying on external execution environments.

Relying on actual code execution for training and inference presents significant practical challenges:

Infrastructure: Setting up execution environments, managing dependencies, and sandboxing are complex.
Cost & Time: Executing code at scale is computationally expensive (e.g., MLE-Bench runs can take up to 9 hours).
Reliability: Models often fail to simulate their own generated code accurately, leading to a "blind spot" where they cannot distinguish between correct and incorrect logic without external feedback.

The core problem addressed is: Can LLMs be trained to simulate program execution step-by-step (including their own generated code), and can this capability be leveraged to improve coding performance without relying on external execution?

2. Methodology

The authors propose a training framework that combines Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR) to instill "self-execution simulation" capabilities. The approach consists of three main stages:

A. Natural Language Execution Tracing (NLEX) - SFT Phase

To teach models how to simulate execution, the authors created a dataset of Natural Language Execution Traces.

Data Collection: They collected executable Python programs from public repositories and competitive programming datasets (CodeContests).
Trace Generation: They recorded line-by-line execution traces (variable states, intermediate values) for specific inputs.
Translation: Using an LLM (Qwen3-32B), these structured traces were converted into free-form natural language explanations. This format allows for semantic context (e.g., explaining why a variable changes in a dynamic programming context) and abstracts away unnecessary details.
Filtering: Only traces where the model's predicted output matched the ground truth were retained, resulting in ~80M descriptions for general functions and 115k for competitive programming.
Training: Models were fine-tuned on this data to predict step-by-step execution explanations given code and inputs.

B. Output Prediction Environment - RLVR Phase

Following SFT, models undergo Reinforcement Learning to refine their execution simulation skills.

Task: Given a (code, input) pair, the model must predict the standard output (stdout).
Reward: A binary reward system is used: +1 if the predicted output matches the ground truth (with a small float tolerance), and -1 otherwise.
Objective: This trains the model to act as a reliable "simulator" for both external code and its own generated solutions.

C. Inference Strategies

The trained models utilize execution simulation in two distinct ways to improve coding performance:

Self-Verification (Best@k Simulation):
- The model generates $k$ candidate solutions.
- For each candidate, the model simulates its execution on public test cases to predict the output.
- The solution predicted to pass the most tests is selected for submission.
- Key Advantage: This filters out incorrect solutions without running actual code, effectively acting as a "self-correcting" filter.
Iterative Self-Fixing (Self-RLEF):
- A multi-turn pipeline where the model generates a solution, simulates its execution on test cases, and receives "feedback" in the form of predicted outputs vs. expected outputs.
- Based on this simulated feedback, the model decides to either submit the code or generate a fix.
- This process repeats until a solution is submitted or the turn limit is reached.
- Note: During training, the model is initially exposed to ground-truth execution feedback to stabilize learning, but at inference, it relies entirely on its own simulated predictions.

3. Key Contributions

Training Recipe: Demonstrated that code LLMs can be trained to simulate program execution step-by-step for both external and self-generated code using NLEX data and RLVR.
Self-Verification Framework: Introduced a practical method for models to filter candidate solutions based on predicted execution outputs, significantly boosting coding accuracy without external tools.
Iterative Self-Fixing (Self-RLEF): Proposed a multi-turn inference pipeline where models iteratively refine code based on simulated execution feedback, reducing the need for real-world execution environments.
Generalization: Showed that a model trained for execution prediction can effectively verify solutions generated by other models (e.g., using a CWM verifier on Qwen solutions), highlighting the generalizability of the approach.

4. Experimental Results

The methods were evaluated on competitive programming benchmarks (LCB-IO, DMC) and execution prediction benchmarks (CruxEval-O).

Output Prediction Accuracy:
- Training with NLEX data significantly improved output prediction. On CruxEval-O, Qwen2.5-3B improved from 37.5% to 68.0% pass@1, and Qwen2.5-7B from 48.5% to 75.5%.
- Models trained jointly on solving and prediction outperformed baselines, though slightly less than models trained only on prediction (as expected due to task trade-offs).
Competitive Programming Performance (Best@k):
- Using self-simulation for Best@k selection improved pass rates by 2–8 percentage points over standard reasoning approaches.
- The "Simulation Gap" (difference between using real execution vs. simulated execution for filtering) was relatively small, indicating high fidelity in the model's simulation.
Iterative Self-Fixing (Self-RLEF):
- Self-RLEF consistently outperformed the official CWM baseline and CWM trained with RL on real execution.
- On DMC, Self-RLEF achieved a 63.2% pass@1 (vs. 49.0% for official CWM).
- Analysis showed the model rarely broke correct solutions (only 1.2% degradation on public tests) but frequently fixed incorrect ones (17.0% success rate in fixing failed initial attempts).
Ablation Studies:
- Applying the Self-RLEF scaffold to untrained open-source models (like Qwen3-32B) yielded no improvement, confirming that the gains come from the specific training (NLEX + RLVR), not just the inference pipeline.
- The approach works even when the model acts as a verifier for solutions generated by different models.

5. Significance and Future Directions

World Modeling in Code: This work positions code execution simulation as a form of "world modeling" within the code domain, allowing models to reason about the consequences of their actions before taking them.
Reduced Dependency on Execution: By enabling models to simulate execution, the approach mitigates the need for expensive, complex, and slow real-world code execution environments during training and inference.
Scalability: The method allows for large rollouts and policy optimization without the bottleneck of sandboxed execution.
Limitations & Future Work: The current approach is limited to single-file competitive programming tasks. Future work aims to extend this to full repository-level tasks (SWE-bench) and incorporate richer feedback signals (e.g., explaining why a test failed, not just the output mismatch) to further enhance iterative fixing.

In conclusion, the paper establishes that equipping LLMs with the ability to simulate their own execution is a powerful mechanism for self-verification and self-correction, leading to state-of-the-art improvements in competitive programming performance without relying on external execution engines.