FactorSmith: Agentic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement

Imagine you want to build a complex video game just by describing it in plain English. You say, "Make a game where a bird flies through pipes, avoiding obstacles."

In the past, asking a super-smart AI (a Large Language Model or LLM) to do this was like asking a single, brilliant architect to design, build, and inspect an entire skyscraper in one breath. The AI would get overwhelmed, forget details, invent fake blueprints that don't exist, or accidentally knock down a wall it wasn't supposed to touch.

FactorSmith is a new framework that solves this by changing how the AI works. It combines two powerful ideas: breaking the job into tiny pieces and using a team of specialized workers to check the work.

Here is how it works, using a simple analogy:

1. The Problem: The Overwhelmed Architect

If you ask a single AI to write the code for a whole game at once, it gets confused. It's like trying to read a 1,000-page instruction manual while trying to write a new chapter. It starts hallucinating (making things up) or missing crucial steps.

2. The Solution: The "Factory Line" Approach

FactorSmith treats game creation like a high-tech assembly line with two main strategies:

Strategy A: The "Focus Lens" (Factored Decomposition)

Instead of showing the AI the entire game code at once, FactorSmith breaks the game down into tiny, manageable modules.

The Analogy: Imagine you are building a house. Instead of handing the carpenter the blueprints for the roof, the plumbing, the electrical wiring, and the landscaping all at once, you give them only the blueprint for the kitchen cabinets.
How it works: The system looks at the game and says, "Okay, for this specific step, we only need to worry about the 'gravity' variable and the 'ball' object." It hides everything else. This keeps the AI's "attention span" focused on exactly what it needs to do right now, preventing confusion.

Strategy B: The "Three-Person Review Team" (Planner-Designer-Critic)

Once the AI is focused on a tiny piece (like the kitchen cabinets), it doesn't just write the code and move on. It uses a team of three virtual agents to refine the work:

The Designer (The Builder): This agent writes the code for the specific piece.
The Critic (The Inspector): This agent looks at the code and gives it a score. It doesn't just say "good" or "bad"; it gives a structured report: "The logic is 8/10, but you forgot to handle the case where the ball hits the wall at a weird angle. That's a 4/10."
The Planner (The Manager): This agent listens to the Critic.
- If the score is high enough, the Manager says, "Great, let's move to the next room."
- If the score is low, the Manager says, "Go back, Designer. Fix the wall angle issue."
- The Safety Net: If the Designer tries to fix it but makes it worse, the Manager has a "Time Machine" (Checkpoint Rollback). It instantly reverts the code to the last good version, ensuring the project never gets worse than it started.

3. Why This is a Game Changer

The paper tested FactorSmith on making 8 different 2D games (like Flappy Bird and Snake).

Old Way (Single Shot): The AI tries to do it all at once. It often crashes or creates broken games.
Old Way (Just Breaking it down): FactorSim (the previous method) broke the game into pieces but didn't have the "Inspector" team. If the AI made a mistake in a small piece, it stayed there.
FactorSmith (The New Way): It breaks the game down and uses the Inspector team to fix mistakes before moving on.

The Result:

Fewer Crashes: The games actually run without breaking.
Better Alignment: The game looks exactly like what you asked for.
Smarter Code: It handles tricky situations (like a ball hitting a wall at a weird angle) much better.

The Bottom Line

FactorSmith is like upgrading from a solo artist trying to paint a masterpiece in one sitting, to a specialized construction crew.

They break the house into rooms (Context Reduction).
They build one room at a time.
They have a strict inspector and a manager who ensure every room is perfect before they even think about the next one.

This allows us to generate complex, playable video games from simple text descriptions with a level of reliability we haven't seen before.

1. Problem Statement

Generating executable simulations (e.g., game code) from natural language specifications using Large Language Models (LLMs) faces two primary challenges:

Context Overload: LLMs struggle with large, interconnected codebases. When presented with the entire codebase, they often hallucinate non-existent functions, ignore specification details, or modify unrelated code.
Lack of Self-Correction: Existing approaches like FactorSim decompose tasks to reduce context but rely on "single-shot" generation. If the initial output is flawed, there is no robust mechanism for iterative self-correction beyond simple retry loops. Conversely, agentic frameworks like SceneSmith improve quality through iteration but do not leverage structural decomposition to manage context windows effectively.

Goal: Create a framework that simultaneously reduces the reasoning burden per generation step (via decomposition) and ensures high-quality output through iterative, multi-agent refinement.

2. Methodology: FactorSmith

FactorSmith unifies Factored POMDP Decomposition with a Planner–Designer–Critic Agentic Workflow.

A. Factored POMDP Decomposition (Context Reduction)

Drawing from FactorSim, the simulation is modeled as a Factored Partially Observable Markov Decision Process (POMDP).

Decomposition: The simulation specification is broken into modular steps ( $q_1, \dots, q_K$ ).
Scope Selection: For each step, the system identifies a minimal subset of relevant state variables ( $S[Z_k]$ ) and functions.
Benefit: Instead of feeding the entire codebase to the LLM, only the "scoped context" relevant to the current step is provided. This drastically reduces the token count and reasoning load.

B. Planner–Designer–Critic Agentic Refinement

Within each modular step, FactorSmith replaces the single LLM call with a three-agent interaction loop:

Planner (Orchestrator): Manages the workflow, tracks scores, and decides whether to accept, request revision, or rollback to a previous checkpoint.
Designer: Proposes code artifacts (functions, state variables) based on the scoped context and step instructions.
Critic: Evaluates the Designer's output against domain-specific rubrics (correctness, completeness, state usage, code quality). It produces structured scalar scores (0–10) and natural language feedback.

The Iterative Loop:

The Designer generates an initial artifact.
The Critic scores it.
If the score is below a threshold ( $\tau$ ), the Planner instructs the Designer to revise based on the Critic's feedback.
Checkpoint Rollback: If a revision results in a lower total score than the previous best, the Planner triggers a rollback to the previous checkpoint, preventing "score regression."

C. Pipeline Architecture

The generation process follows three phases:

High-Level Decomposition: The prompt is split into modular steps (e.g., "add gravity," "add collision").
Factored Step Execution: For each step, the system:
- Selects the relevant state scope.
- Decomposes the step into MVC components (Input Logic, State Transition, UI Rendering).
- Executes the Planner–Designer–Critic loop for each component.
Assembly & Validation: All generated functions are assembled into a complete executable simulation and validated.

3. Key Contributions

Unified Framework: Formalizes the combination of factored POMDP decomposition (for context reduction) and agentic refinement (for quality control) within a single pipeline.
Mathematical Formalization: Provides a theoretical framework showing how agentic refinement composes with factored context selection, including proofs for Quality Monotonicity (ensuring that accepted checkpoints do not degrade in quality).
Structured Scoring Mechanism: Introduces a domain-specific, structured scoring system (replacing free-form text or generated test cases) that allows for precise, programmatic evaluation and delta tracking between refinement rounds.
Open-Source Implementation: Releases a Python implementation built on the OpenAI Agents SDK, featuring SQLite-backed session management for state persistence and rollback.

4. Experimental Results

The framework was evaluated on the PyGame Learning Environment (PLE) benchmark, consisting of 8 diverse 2D RL games (e.g., Flappy Bird, Snake, Pong).

Performance: FactorSmith achieved the highest System Test Pass Rate across all games, outperforming all baselines.
- Compared to the strongest baseline (FactorSim), FactorSmith showed consistent gains (e.g., +8% on Catcher, +8% on Waterworld).
- It significantly outperformed non-factored agentic approaches (like AgentCoder) and single-shot methods.
Ablation Study:
- Removing the Critic: Performance dropped by ~7%, confirming the value of iterative evaluation.
- Removing Rollback: Caused a consistent degradation, proving the safety mechanism prevents quality loss during refinement.
- Removing Factorization: Using full context with agentic refinement caused the largest drop (~12%), proving that context reduction is the most critical factor.
Token Efficiency: While FactorSmith uses more tokens than single-pass FactorSim due to multi-round refinement, it is more efficient than CoT + Self-Debug because the scoped context is smaller, and structured scoring allows for earlier termination than blind retry loops.

5. Significance and Conclusion

FactorSmith addresses the dual limitations of current LLM-based code generation: context limits and lack of iterative quality control.

Complementarity: The paper demonstrates that decomposition and refinement are complementary. Decomposition ensures the LLM isn't overwhelmed by irrelevant data, while the agentic trio catches subtle errors (off-by-one, missing edge cases) that a single-shot call would miss.
Robust Evaluation: By using structured scoring rubrics instead of generated test cases (which are error-prone in complex simulations), FactorSmith provides a more stable and reliable feedback loop.
Future Impact: The framework offers a modular architecture for generating complex interactive systems. Future work aims to incorporate execution-based feedback, extend to 3D robotics simulations, and utilize specialized smaller models to reduce costs.

In summary, FactorSmith represents a significant step forward in automated simulation generation, proving that combining structural problem decomposition with multi-agent deliberation yields superior code quality and robustness.

FactorSmith: Agentic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement

1. The Problem: The Overwhelmed Architect

2. The Solution: The "Factory Line" Approach

Strategy A: The "Focus Lens" (Factored Decomposition)

Strategy B: The "Three-Person Review Team" (Planner-Designer-Critic)

3. Why This is a Game Changer

The Bottom Line

1. Problem Statement

2. Methodology: FactorSmith

A. Factored POMDP Decomposition (Context Reduction)

B. Planner–Designer–Critic Agentic Refinement

C. Pipeline Architecture

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics

Domain-Specialized Tree of Thought through Plug-and-Play Predictors

Me, Myself, and π\piπ : Evaluating and Explaining LLM Introspection

AgentComm-Bench: Stress-Testing Cooperative Embodied AI Under Latency, Packet Loss, and Bandwidth Collapse

Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection