Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

The Big Idea: "You Are What You Eat" (But for AI)

Imagine you are teaching a robot chef how to cook a complex meal.

Scenario A: You teach the chef in a kitchen where ingredients stay on the counter. If they chop an onion and put it in a bowl, the bowl stays there for the next step.
Scenario B: You teach the chef in a kitchen where, after every single step, a magical vacuum cleaner sucks up everything off the counter. The chef has to write down "I have a bowl of onions" on a piece of paper, read it, and then rebuild the bowl from scratch for the next step.

This paper asks: Does it matter which kitchen the chef was trained in?

The answer is a loud YES.

The researchers found that AI agents (robots that use tools like code) don't just learn how to solve a problem; they learn how to use the kitchen they are in. If you train them in one type of kitchen but send them to a different one, they get confused, waste energy, or crash completely.

The Experiment: The "Opaque Knapsack" Game

To test this, the researchers invented a game called Opaque Knapsack.

The Game: Imagine you have a backpack with a weight limit. You have a pile of mystery boxes. You don't know what's inside (how heavy or valuable they are). You have a limited number of "looks" (budget) to peek inside a box before deciding to pack it.
The Goal: Pack the most valuable items without breaking the bag.
The Catch: You can't see the boxes all at once. You have to peek, decide, pack, peek again, and adjust your plan. This requires memory.

They ran this game with AI agents under four different conditions, crossing two types of Training with two types of Testing:

Persistent Training: The AI was trained in a "Magic Kitchen" where variables (like my_list_of_items) stayed alive between steps.
Stateless Training: The AI was trained in a "Reset Kitchen" where everything vanished after every step, forcing the AI to write everything down in text to remember it.

Then, they tested these AIs in both kitchens.

The Results: What Happened?

1. The "Amnesia Tax" (Training Stateless, Testing Persistent)

The Scenario: You trained the AI in the "Reset Kitchen" (where it had to write everything down), but you sent it to the "Magic Kitchen" (where things stay on the counter).
The Result: The AI didn't realize the counter was safe. It kept writing everything down on paper anyway, even though it could have just left the bowl on the counter.
The Metaphor: It's like a student who was taught to write every single math step on a scrap of paper because their teacher erased the whiteboard every minute. Even when you give them a permanent whiteboard, they still write on the scrap paper.
The Cost: This wasted about 3.5 times more energy (tokens) than necessary. They solved the problem, but they were incredibly inefficient. The researchers call this the "Amnesia Tax."

2. The "Cascading Crash" (Training Persistent, Testing Stateless)

The Scenario: You trained the AI in the "Magic Kitchen" (where it learned to trust that variables stay alive), but you sent it to the "Reset Kitchen."
The Result: Disaster. The AI tried to grab a variable (like my_list) that it thought was on the counter, but the vacuum cleaner had already sucked it away.
The Metaphor: Imagine a chef trained in a kitchen where the stove stays hot. You send them to a kitchen where the stove turns off automatically after every minute. They try to cook on a cold stove, get confused, try to fix it, fail again, and get stuck in a loop of panic.
The Cost: The AI crashed in 80% of the attempts. It entered a loop of errors, trying to "remember" things that didn't exist, burning through its energy budget without making progress.

3. The "Happy Path" (Matched Training and Testing)

The Scenario: Training and testing in the same kitchen.
The Result: The AI performed well.
The Surprise: Interestingly, the quality of the final solution (did they get the best backpack?) was roughly the same in all cases. The difference wasn't if they solved it, but how much effort it took and how stable the process was.

The Key Takeaway: "Runtime" is a Design Choice, Not a Bug

For a long time, developers thought the "runtime" (the computer environment where the code runs) was just a boring technical detail, like the color of the walls in a classroom. They thought, "The AI learns the math; the walls don't matter."

This paper proves that is wrong.

The "runtime" is part of the lesson.

If you want an AI to be efficient and use the computer's memory, you must train it in an environment where the memory works.
If you train it to rely on writing things down in text, it will do that forever, even if it's wasteful.
If you train it to rely on memory, it will crash if you suddenly take that memory away.

The Bottom Line for Humans

Think of an AI agent like a new employee.

If you train them using a specific software tool (like a persistent database), they will learn to rely on that tool.
If you suddenly switch them to a different system (like a stateless text log) without retraining them, they won't just be "slower"; they might make catastrophic mistakes because their mental model of how the world works is broken.

Conclusion: When building AI agents, you cannot treat the environment they run in as an afterthought. You must design the training data to match the real-world environment exactly, or the AI will pay a heavy "tax" in wasted energy or suffer from "amnesia" and crashes.

1. Problem Statement

Tool-augmented Large Language Model (LLM) agents often solve tasks by interleaving natural language reasoning with executable Python code. A critical architectural choice in these systems is interpreter persistence:

Persistent Runtime: Variables and data structures defined in one turn remain available in subsequent turns (e.g., CodeAct-style agents).
Stateless Runtime: The interpreter state is reset after every action; agents must re-derive or re-print state in the text context window to access it in the next turn.

The Core Question: Is interpreter persistence merely a runtime scaffold (an implementation detail of the inference environment), or is it a learned behavioral prior embedded in the training data? The authors hypothesize that if agents are fine-tuned on traces generated in a persistent environment, they learn to rely on that persistence. Consequently, deploying such an agent in a stateless environment (or vice versa) could lead to catastrophic failures or severe inefficiencies, regardless of the model's inherent reasoning capabilities.

2. Methodology

The authors conducted a controlled 2 × 2 cross-evaluation study to disentangle training-time semantics from runtime semantics.

A. The Benchmark: OPAQUE KNAPSACK

To ensure the task required genuine state management and could not be solved by a single "one-shot" script, the authors introduced OPAQUE KNAPSACK, a partially observable variant of the 0/1 knapsack problem.

Partial Observability: Item attributes (weight, value, class) and feasibility constraints are hidden.
Budgeted Tool Access: Agents must use a limited number of inspect tool calls to reveal item details.
Non-Collapsible: The task forces iterative information gathering and plan revision over multiple turns.
Tools: list_items(), inspect(id), and take_item(id).

B. Experimental Design

The study crossed two variables:

Training Semantics: Fine-tuning on traces generated in either a Persistent or Stateless interpreter.
Runtime Semantics: Evaluating the fine-tuned model in either a Persistent or Stateless interpreter.

This resulted in four conditions:

Persistent Train $\to$ Persistent Runtime (Aligned)
Stateless Train $\to$ Stateless Runtime (Aligned)
Persistent Train $\to$ Stateless Runtime (Mismatch)
Stateless Train $\to$ Persistent Runtime (Mismatch)

Data Generation:

Used a CodeAct-style teacher agent (Gemini 3 Flash) to generate 1,000 matched trajectories for each regime.
The only difference between the two trace sets was the execution contract (persistence vs. reset).
Fine-tuned identical Qwen3-8B models using LoRA on each dataset.

C. Metrics

Solution Quality: Normalized optimality (achieved value / optimal value).
Efficiency: Total tokens consumed, steps taken, and wall-clock time.
Behavioral Signatures: Variable reuse rates, re-import frequency, and error types (e.g., NameError).

3. Key Contributions

OPAQUE KNAPSACK Benchmark: A novel, non-collapsible benchmark designed specifically to force multi-turn state maintenance and iterative reasoning, preventing agents from solving tasks via single long scripts.
Evidence of Learned Persistence: The study demonstrates that interpreter persistence is not just a runtime feature but a learned inductive bias. Agents absorb the execution semantics of their training data and carry them into deployment.
Identification of Failure Modes: The paper characterizes two distinct failure modes caused by train-runtime misalignment:
- The "Amnesia Tax": Stateless-trained agents redundantly re-derive state in text even when a persistent runtime is available, consuming ~3.5× more tokens.
- Cascading Recovery Loops: Persistent-trained agents deployed in stateless runtimes fail with NameError (missing variables) in ~80% of episodes, entering destructive loops that consume the token budget without progress.

4. Key Results

A. Solution Quality vs. Efficiency

Quality: There was no statistically significant difference in solution quality (normalized optimality) between the four conditions on the Easy split, and only marginal differences on the Hard split. This suggests that whether the agent solves the task is robust, but how it solves it is highly sensitive to alignment.
Efficiency (The "Amnesia Tax"):
- Aligned Persistent (Persistent $\to$ Persistent): Most efficient. The model reuses executable state, requiring significantly fewer tokens (e.g., ~19k tokens on Hard tasks).
- Aligned Stateless (Stateless $\to$ Stateless): Less efficient. The model re-derives state in text, requiring ~~3.5× more tokens (~~67k tokens on Hard tasks).
- Mismatch (Stateless $\to$ Persistent): The model continues to re-import and re-derive state despite the runtime retaining it, paying the "amnesia tax" unnecessarily.
- Mismatch (Persistent $\to$ Stateless): The model attempts to access variables that no longer exist, triggering NameError exceptions.

B. Behavioral Signatures

State Utilization: Only the Persistent $\to$ Persistent condition showed genuine long-range variable reuse (State Utilization > 0).
Re-import Rate: Stateless-trained models re-imported libraries/variables in nearly 100% of steps, regardless of the runtime environment, confirming this is a learned habit, not a runtime necessity.
Context Lifespan: Persistent-trained models forced into stateless runtimes continued to reference symbols with high "context lifespan" (referencing variables defined many turns ago in text), leading to execution errors because the interpreter had reset.

C. Failure Analysis

Persistent $\to$ Stateless: 80% of episodes contained unresolved reference errors (NameError). These errors triggered cascading recovery loops, leading to high instability and wasted tokens.
Stateless $\to$ Stateless: No reference errors; the model successfully reconstructed state via text, albeit inefficiently.

5. Significance and Implications

Runtime as a Design Choice: The execution semantics of the runtime used to generate fine-tuning traces must be treated as a first-class design decision, not a hidden implementation detail.
Alignment is Critical: Mismatching training and deployment runtimes does not just reduce efficiency; it can cause catastrophic instability (cascading errors) in agents trained on persistent data.
Token Volume is Misleading: The study shows that raw token count is a poor proxy for learnability. Stateless traces contained ~3.5× more tokens during generation but resulted in less efficient downstream performance and lower solution quality on hard tasks.
Practical Guidance:
- If deploying with a persistent interpreter, fine-tune on persistent traces to maximize efficiency and stability.
- If deploying with a stateless interpreter, fine-tune on stateless traces to avoid NameError crashes.
- "Amnesia" (redundant state re-derivation) is a learned behavior that persists even if the runtime capability changes.

In conclusion, the paper establishes that agents learn their runtime. The contract between the agent and the interpreter is internalized during training, and violating this contract leads to specific, predictable, and costly behavioral failures.