The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Here is an explanation of the paper "The World Won't Stay Still" using simple language and creative analogies.

The Big Idea: The World Changes, But Our Tests Don't

Imagine you are teaching a robot chef how to cook.

The Old Way: You give the robot a recipe and a kitchen with a fixed set of tools (a knife, a pan, an oven). You test the robot once. If it cooks the meal, it gets an "A."
The Problem: In the real world, kitchens change! Sometimes the oven breaks, sometimes a new smart-fridge is installed, and sometimes the recipe changes because the chef (you) decides you want less salt. If you only test the robot in the original kitchen, you don't know if it can handle a broken oven or a new tool.

This paper argues that current tests for AI agents are too static. They treat the world like a frozen photograph, but the real world is a moving movie. The authors want to build a way to test AI agents in a kitchen that changes while the robot is cooking.

The Solution: "ProEvolve" (The Shape-Shifting Kitchen)

The authors built a framework called ProEvolve. Think of this as a magical blueprint for a kitchen that can rewrite its own rules while the robot is working.

1. The Blueprint: The "Graph"

Instead of just listing tools and data, the authors map the entire environment as a connected web (a graph).

Analogy: Imagine a subway map. The stations are the data (like "User Name" or "Order ID"), and the train lines are the tools (like "Check Order" or "Cancel Ticket").
Why it matters: In a normal test, if you remove a station, the map breaks. In ProEvolve, because everything is connected on a map, the system knows exactly which lines are affected if a station closes. It keeps the whole system coherent.

2. The Three Magic Moves (Evolution Strategies)

The system can change the environment in three specific ways, just like a real business evolves:

Completion (Adding New Rooms): The system adds new features.
- Analogy: The restaurant decides to start serving breakfast. The system automatically adds a "Breakfast Menu" station and a "Coffee Maker" tool to the subway map, connecting them to the existing "Customer" station.
Saturation (Building Express Lanes): The system finds shortcuts.
- Analogy: Currently, to get a customer's order history, the robot has to take three different train lines. The system notices this is slow and builds a "Direct Express Line" (a new tool) that goes straight from "Customer" to "Order History."
Deprecation (Closing Stations): The system removes old features.
- Analogy: The restaurant decides to stop serving soup. The system removes the "Soup Station" and the "Soup Pot" tool from the map. The robot must now figure out how to handle a customer asking for soup without breaking the system.

3. The "Task Sandbox" (The Simulation)

Once the environment changes, the system instantly creates a new "test scenario" (a sandbox).

Analogy: If the "Soup Station" closes, the system immediately generates a new customer who says, "I want soup!" The test isn't just "Can the robot cook?" it's "Can the robot realize the soup station is closed and politely tell the customer, or find a workaround?"

What They Found (The Experiments)

The authors took a single online store (like Amazon) and used ProEvolve to turn it into 200 different versions of that store, creating 3,000 unique tasks. They then tested top AI models (like GPT-5, Claude, and others) in these changing worlds.

Here is what they discovered:

AI is Fragile: When the environment changed (e.g., a tool was removed), many AI agents got confused and failed, even if they were smart in the static version. They couldn't adapt to the "broken oven."
No One-Size-Fits-All: Some AIs got better when new tools were added (Completion), but others got worse when tools were removed (Deprecation). There is no single "best" agent for a changing world.
Memory Helps (But Not Always): They tested if letting the AI remember its past conversations helped.
- Result: Sometimes remembering the past helped the AI adapt. Other times, it made the AI overthink or get stuck in old habits. It depends on the specific AI model.
Cost vs. Success: To succeed in a changing world, the AI often had to ask more questions and use more tools (spending more "money" or computing power). The most "efficient" AI (the one that used the fewest tools) often failed the hardest tasks because it didn't explore enough.

Why This Matters

This paper is a wake-up call. We can't just test AI in a perfect, unchanging world. If we want AI to work in the real world (where APIs break, data changes, and new features appear), we need to test them in environments that evolve.

The Takeaway:
The world won't stay still. If we want our AI agents to be truly robust, we need to stop testing them in a museum (static) and start testing them in a living, breathing city (dynamic). ProEvolve is the tool that lets us build that city.

Here is a detailed technical summary of the paper "The World Won't Stay Still: Programmable Evolution for Agent Benchmarks".

1. Problem Statement

Current benchmarks for Large Language Model (LLM) agents predominantly rely on static environments with fixed toolsets and data schemas. This approach fails to reflect the dynamic nature of real-world deployments, where environments evolve continuously through:

Incremental additions: New capabilities and tools are introduced.
Iterative updates: Existing tools and schemas are modified.
Deprecation: Outdated features and APIs are removed.

This static assumption creates a gap in evaluating an agent's adaptability and robustness. Existing attempts to scale benchmarks often treat environments as isolated snapshots or scale only along single axes (e.g., more tools), failing to capture the coherent evolution of interconnected components (tools, data, and schemas) or the controllability required for systematic evaluation.

2. Methodology: PROEVOLVE

The authors propose PROEVOLVE, a graph-based framework that makes environment evolution programmable, scalable, and controllable. The methodology consists of three core components:

A. Graph Formalism for Environment Modeling

The environment is represented as a Typed Relational Graph $G = (V, E)$ :

Nodes ( $V$ ): Represent schema elements (e.g., User.user_id, Order.order_id).
Edges ( $E$ ): Represent typed relations or tool-enabled transitions mapping information between schema elements.
Evolution: Defined as a sequence of graph transformations: $G^{(0)} \xrightarrow{\Delta^{(1)}} G^{(1)} \xrightarrow{\Delta^{(2)}} \dots \xrightarrow{\Delta^{(K)}} G^{(K)}$ .
This formalism ensures that changes propagate coherently across tools, schemas, and data access, maintaining structural integrity.

B. Programmable Environment Evolution (Agentic Pipeline)

To generate diverse evolution trajectories automatically, PROEVOLVE employs an LLM-based agent pipeline with two phases:

Evolution Proposal: An agent traverses the graph to propose transformations based on three strategies:
- Completion ( $\Delta_{comp}$ ): Adds nodes/edges to support new features (e.g., adding a "wishlist" capability).
- Saturation ( $\Delta_{sat}$ ): Identifies indirect relationships via random walks and adds "shortcut" tools to improve efficiency (e.g., a direct tool to get a user's purchased products).
- Deprecation ( $\Delta_{dep}$ ): Removes nodes/edges to simulate API retirement or service outages, requiring agents to find workarounds.
Implementation & Validation: A Coding Agent translates the evolved graph into executable code (data models, tool implementations, and unit tests). The system ensures 100% test coverage for modified code and validates the coherence of the new environment.

C. Task Sandbox Generation & Evaluation

Subgraph Sampling: Tasks are generated by sampling a connected subgraph $G_\tau \subseteq G^{(k)}$ that defines the structural scope of a task.
Sandbox Materialization: The system synthesizes prerequisite entities and links them according to the sampled subgraph to create a runnable sandbox.
State-Wise Evaluation: Instead of binary pass/fail metrics, the framework uses a state-wise user simulator.
- A multi-turn dialogue is simulated where a "user" provides instructions based on the current reachable context.
- Success is measured by State Success Rate ( $C(\tau)$ ): the fraction of states where the agent successfully retrieves the required information to proceed.
- This approach evaluates the agent's ability to navigate dependencies and adapt to changing information availability turn-by-turn.

3. Key Contributions

Formalization of Evolution: The paper introduces the first explicit formulation of agent evaluation in evolving environments as a standalone research problem, moving beyond static snapshots.
PROEVOLVE Framework: A novel graph-based framework that enables:
- Programmable Evolution: Automatic generation of coherent environment variants via graph transformations.
- Task Synthesis: Automatic creation of task sandboxes conditioned on specific environment states.
Scalable Benchmark Generation: The authors validated the framework by evolving a single seed e-commerce environment into 200 distinct environment variants and 3,000 task sandboxes.
Empirical Insights: Provided the first large-scale study on how agents adapt to environmental changes, revealing significant gaps in current models' robustness.

4. Experimental Results

The framework was tested on 200 environment versions derived from 50 evolution trajectories, benchmarking five representative LLMs (GPT-5, Claude-Opus-4.5, DeepSeek-V3.2, Qwen3-235B, Gemini-2.5-Pro).

High Variability in Performance: Agent performance fluctuated dramatically across evolution stages. For instance, GPT-5's success rate increased by 40% after capability additions but dropped by 48% after deprecations. No consistent pattern of improvement or decline was observed across all models.
Model-Specific Adaptation: Different agents exhibited distinct behaviors. Some (like GPT-5) became more tool-intensive as environments grew complex, while others (like Gemini-2.5 Pro) remained conservative.
Limitations of Replay Strategies: The study tested "History Replay" (reusing raw past interactions) and "Reflection Replay" (reusing distilled summaries). Results showed that simply accessing past experience does not consistently improve performance. In some cases (e.g., Claude-Opus-4.5), Reflection Replay led to over-exploration and worse performance, suggesting current agents lack effective mechanisms to generalize past experiences to new environmental dynamics.
Cost-Robustness Trade-off: Harder tasks required significantly more tool calls and turns. Models that achieved higher success rates often incurred higher interaction costs, highlighting a trade-off between robustness and efficiency in dynamic settings.

5. Significance

Realism: PROEVOLVE bridges the gap between static benchmarks and the chaotic, evolving nature of real-world software ecosystems.
Methodological Shift: It establishes a systematic methodology for generating controlled evolution trajectories, allowing researchers to isolate specific types of environmental changes (additions vs. removals) and measure their impact.
Future Directions: The work highlights the need for agents that can explicitly recognize and respond to environmental evolution, rather than relying on static training or simple memory replay. It paves the way for developing more resilient AI systems capable of operating in production environments where APIs and data schemas are in constant flux.

In summary, "The World Won't Stay Still" argues that to truly evaluate agent robustness, we must stop testing them in frozen worlds and start testing them in worlds that change, using a programmable, graph-driven approach to simulate that change.