Reducing Cost of LLM Agents with Trajectory Reduction

Imagine you are hiring a brilliant but very expensive detective (the LLM Agent) to solve a complex mystery, like fixing a bug in a massive software codebase.

Every time the detective takes a step—checking a file, running a command, or reading a log—they write it down in a giant notebook called the Trajectory. This notebook is passed to the detective for the next step so they remember what happened.

The Problem: The "Cluttered Notebook"

The paper points out a huge inefficiency: The notebook gets too heavy.

As the detective solves the mystery, the notebook fills up with:

Useless Info: "I opened the folder __pycache__." (Who cares? It's just temporary junk.)
Redundant Info: "I just told you I'm replacing line 50 with line 50." (We already know that!)
Expired Info: "I checked 50 files to find the one with the bug." (Now that we found the bug, we don't need to remember the other 49 files we looked at.)

Because the detective has to read the entire notebook every single time they take a new step, the cost (in money and time) explodes. It's like hiring a detective who has to re-read a 500-page history book every time they decide to open a drawer, even though the book is mostly about irrelevant history.

The Solution: "AgentDiet"

The authors propose a solution called AgentDiet. Think of this as a smart, frugal editor who sits next to the detective.

Here is how it works:

The Detective Works: The detective solves a step and writes it in the notebook.
The Editor Steps In: Before the detective starts the next step, the Editor (a cheaper, faster AI) looks at the previous entry in the notebook.
The Diet: The Editor asks, "Do we really need all these words?"
- Editor: "You listed 73 test results, but only one failed. Let's delete the 72 'Passed' ones and just write '72 passed, 1 failed'."
- Editor: "You pasted a 10-page file, but only changed one line. Let's keep the context but summarize the rest."
The Result: The notebook is now much lighter. The detective can read it faster, and the company pays less for the detective's time.

Why This is a Big Deal

The paper tested this on real-world coding tasks (fixing bugs on GitHub) and found some amazing results:

Massive Savings: They cut the amount of "reading material" the detective had to process by 40% to 60%.
Cheaper Bills: Because the detective reads less, the total cost of the job dropped by 21% to 36%.
No Loss of Smarts: The most surprising part? The detective solved the same number of problems correctly. In fact, in some cases, the detective solved them faster because they weren't getting confused by a cluttered notebook.

The "Secret Sauce"

The authors realized that if they asked the detective to clean their own notebook, the detective would get distracted and forget the main task. So, they hired a separate, cheaper editor (a different, smaller AI model) to do the cleaning. This editor is so cheap that the savings from the detective's reduced reading time far outweigh the cost of hiring the editor.

The Bottom Line

This paper proves that less is more. By simply trimming the fat from the "conversation history" of AI agents, we can make them significantly cheaper and faster without making them any less smart. It's like decluttering your workspace: you can work better when you aren't tripping over old papers.

1. Problem Statement

Large Language Model (LLM) agents are increasingly used for complex software engineering tasks (e.g., code generation, debugging, and repair). However, their adoption is hindered by high computational costs, primarily driven by the ever-growing trajectory of multi-turn interactions.

The Core Issue: In a typical agent workflow, every tool call and its output are appended to the trajectory. As the agent solves a task, the context window accumulates massive amounts of tokens.
The Waste: Analysis reveals that trajectories contain significant "waste" in three forms:
1. Useless Information: Irrelevant data (e.g., verbose build logs, cache directories, or successful test outputs that don't contribute to the solution).
2. Redundant Information: Repeated content (e.g., the old_str and new_str in file editing tools often repeat code already seen in previous steps).
3. Expired Information: Context relevant to a past step but no longer necessary (e.g., file listings from a directory search once the target file is identified).
Current Limitations: Existing token reduction methods focus on single-turn tasks or natural language prompts, failing to address the iterative, structured nature of agent trajectories. Furthermore, agents rarely reduce their own context autonomously due to training biases toward task completion.

2. Methodology: AgentDiet

The authors propose AgentDiet, an inference-time trajectory reduction approach that automatically identifies and removes waste during agent execution without modifying the underlying LLM.

Key Design Components

Reflection Module (External Control):
- Instead of relying on the agent to reduce its own context (which often fails), AgentDiet employs a separate Reflection Module.
- This module is triggered by the outer system at specific intervals to analyze and compress the trajectory.
- It uses a cost-efficient LLM (e.g., GPT-5 mini) distinct from the main agent LLM (e.g., Claude 4 Sonnet) to minimize overhead.
Sliding Window Mechanism:
- To manage overhead and preserve KV Cache efficiency, the system does not process the entire history at once.
- It uses a sliding window defined by hyperparameters:
  - $a$ (Delay): The number of steps after a step $s$ before it is eligible for reduction (e.g., reduce step $s-a$ when at step $s$ ). This prevents the reduction of the most recent, critical context.
  - $b$ (Context): The number of preceding steps provided as context to the reflection LLM to ensure it understands the flow.
  - $\theta$ (Threshold): A token length threshold. Reduction is only attempted if the target step exceeds this length, avoiding overhead for short steps.
Reduction Strategy:
- The Reflection LLM is prompted to identify and remove the three types of waste (useless, redundant, expired).
- It replaces removed content with concise placeholders (e.g., "individual test lines omitted; mostly PASSED") while preserving the structural integrity (XML tags, tool call formats) and critical information (e.g., specific error messages).
Algorithm Integration:
- AgentDiet is integrated into the standard agent loop (Algorithm 1). After the agent executes a tool and appends the result to the trajectory, the system checks if the conditions for reflection are met. If so, it invokes the Reflection LLM, receives the reduced version, and updates the trajectory.

3. Key Contributions

Empirical Discovery: The paper provides the first large-scale analysis revealing that LLM agent trajectories contain widespread, removable waste (up to 70% of processed tokens in some steps).
AgentDiet Framework: A novel, open-source, inference-time reduction approach that is model-agnostic and easily integrable into existing agent systems.
Novel Mechanism: The introduction of an external "Reflection Module" with a sliding window strategy to balance cost, latency, and performance, overcoming the inability of agents to self-reduce context.
Comprehensive Evaluation: Extensive experiments across multiple benchmarks, LLMs, and programming languages.

4. Experimental Results

The authors evaluated AgentDiet on SWE-bench Verified and Multi-SWE-bench Flash using Claude 4 Sonnet and Gemini 2.5 Pro.

Efficiency Gains:
- Input Token Reduction: Reduced input tokens by 39.9% – 59.7%.
- Cost Reduction: Achieved a 21.1% – 35.9% reduction in total computational cost (accounting for the overhead of the reflection module).
- Token Retention: The system retained only ~22%–30% of the tokens in the steps it processed, effectively compressing the trajectory.
Performance Impact:
- Task Success Rate (Pass%): AgentDiet maintained performance parity with the original agent, with a variation of -1.0% to +2.0%.
- Step Count: In most cases, the number of steps required to solve a task did not increase. Notably, for Gemini 2.5 Pro on complex benchmarks, AgentDiet reduced the average steps (from 57.2 to 43.9) by preventing the model from entering "looping" states caused by excessive context length.
Generalization:
- The approach generalized well across different programming languages (Rust, TypeScript, Java, C++, etc.) and different LLM architectures.
- It worked effectively even when the reflection module used a significantly cheaper model (GPT-5 mini) compared to the agent model.

5. Significance and Conclusion

Breaking the Efficiency-Performance Trade-off: Contrary to the belief that reducing context (test-time compute) harms performance, this paper demonstrates that removing noise can maintain or even improve agent robustness.
Practical Applicability: AgentDiet offers a "drop-in" solution for existing agent products (like Cursor or Claude Code) to drastically reduce operational costs without requiring model fine-tuning or white-box access.
Future Direction: The work establishes inference-time trajectory reduction as a critical research direction for making LLM agents economically viable for large-scale software engineering applications.

Artifact Availability: The implementation of AgentDiet and experimental data are publicly available at https://doi.org/10.6084/m9.figshare.30073654.