EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

Imagine you hire a team of four specialists to solve a complex problem for you, like planning a cross-country road trip that involves booking flights, finding hotels, and reserving rental cars.

In the world of AI, this "team" is an LLM Agent (a smart computer program) that uses external tools (like APIs) to get things done. The paper introduces a new system called EVOTOOL to teach this team how to work better.

Here is the breakdown of the problem and the solution, using simple analogies.

The Problem: The "Blame Game" and the "Sledgehammer"

When the team fails to book the trip, it's usually because of a long chain of events. Maybe the planner picked the wrong dates, the selector chose the wrong airline, the caller typed the wrong code, or the synthesizer wrote a confusing summary.

Current methods of fixing AI agents have two big flaws:

The Sledgehammer Approach (Monolithic): If the trip fails, the old way is to tell the entire team, "You all messed up! Change everything!" This is like firing the whole crew and hiring new ones because one person made a typo. It often breaks things that were working fine.
The Guessing Game (Single-Aspect): Another way is to just tweak one person, say, the Planner, without knowing if they were actually the problem. It's like blaming the chef for a burnt steak when the fire was actually too hot.

Because the AI only gets a "Pass" or "Fail" at the very end, it doesn't know who is to blame. This is called the Credit Assignment Problem.

The Solution: EVOTOOL (The Smart Coach)

EVOTOOL is a self-improving system that acts like a smart coach who watches the game, finds the exact mistake, and gives specific advice to the right player. It breaks the AI agent into four distinct roles (modules):

The Planner: Breaks the big task into small steps.
The Selector: Decides which tool (like which API) to use.
The Caller: Actually uses the tool and fills in the forms.
The Synthesizer: Puts all the answers together into a final response.

EVOTOOL improves this team using three clever tricks:

1. Trajectory-Grounded Blame Attribution (The Detective)

Instead of guessing who failed, EVOTOOL acts like a detective. It reviews the "replay" of the agent's actions (the trajectory).

Analogy: Imagine a sports referee reviewing the video. If the team lost because the quarterback threw the ball to the wrong receiver, the referee doesn't blame the coach or the kicker. They point specifically at the quarterback and say, "You made the wrong choice."
How it works: The system analyzes the error logs and assigns a "blame score" to each of the four modules. It identifies exactly which module caused the failure.

2. Feedback-Guided Targeted Mutation (The Personal Trainer)

Once the detective finds the culprit, the coach doesn't rewrite the whole playbook. They give specific, natural language feedback to just that one person.

Analogy: If the quarterback is the problem, the coach doesn't fire the whole team. They pull the quarterback aside and say, "Next time, check the receiver's position before throwing." The rest of the team (the kicker, the defense) keeps doing what they are good at.
How it works: The system edits only the instructions (the prompt) for the blamed module, leaving the others untouched. This prevents "regressions" (breaking things that were already working).

3. Diversity-Aware Population Selection (The Talent Scout)

In evolution, if you only pick the single "best" candidate, you might lose unique skills. Maybe one agent is great at booking flights but bad at hotels, while another is the opposite.

Analogy: A talent scout doesn't just hire the one person who won the last race. They keep a diverse team: a sprinter, a marathon runner, and a jumper. They know that different tasks need different strengths.
How it works: EVOTOOL keeps a "population" of different agent versions. It doesn't just pick the one with the highest average score; it keeps agents that are the best at specific types of tasks. This ensures the team doesn't become too narrow and can handle a wide variety of problems.

The Results: Why It Matters

The paper tested EVOTOOL on four different "obstacle courses" (benchmarks) involving complex tool use.

The Outcome: EVOTOOL beat the best existing methods by a significant margin (over 5 points).
Efficiency: It learned faster and used fewer computer resources (tokens) because it didn't waste time retraining the whole team, just the specific part that needed fixing.
Transferability: The skills it learned on one type of task (like booking flights) transferred well to other tasks (like managing databases), proving it learned how to use tools, not just memorized answers.

Summary

EVOTOOL is like a self-improving robot coach. Instead of blindly guessing why an AI failed or rewriting its entire brain, it:

Investigates the replay to find the exact mistake.
Coaches only the specific part of the brain that made the mistake.
Keeps a diverse team so it doesn't lose unique talents.

This makes AI agents more reliable, efficient, and better at solving complex, real-world problems.

1. Problem Statement

LLM-based agents rely on effective tool-use policies to solve complex, long-horizon tasks involving goal decomposition, tool selection, argument construction, and output synthesis. However, optimizing these policies faces two critical challenges:

Delayed Supervision & Credit Assignment: In long trajectories, feedback (success/failure) is often only available at the very end. This creates a "credit assignment problem" where it is difficult to determine which specific module (e.g., planning vs. tool calling) caused the failure.
Limitations of Existing Approaches:
- Monolithic Optimization: Methods that treat the entire agent prompt as a single block (e.g., OPRO, PromptBreeder) often entangle heterogeneous behaviors. Fixing one error can destabilize other capabilities, leading to regressions.
- Single-Aspect Optimization: Methods that optimize individual components in isolation (e.g., only planning or only tool calling) ignore cross-module error propagation. A failure in synthesis might stem from a planning error, but optimizing only the synthesizer fails to address the root cause.

2. Methodology: EVOTOOL Framework

EVOTOOL proposes a self-evolving, gradient-free evolutionary paradigm that optimizes a modular tool-use policy without updating the underlying LLM weights.

A. Modular Policy Decomposition

The agent's policy is decomposed into four distinct, evolvable modules, all sharing a frozen base LLM ( $W$ ) but conditioned on specific specifications ( $\Theta$ ):

Planner: Decomposes tasks into subgoals.
Selector: Decides which tool to call based on the current state.
Caller: Constructs valid arguments and executes the tool.
Synthesizer: Integrates tool outputs into the final response.

The optimization objective is to evolve the specifications $\Theta = \{\theta_{plan}, \theta_{sel}, \theta_{call}, \theta_{syn}\}$ to maximize task success.

B. Three Core Mechanisms

EVOTOOL iterates through a self-improving loop using three novel mechanisms:

Trajectory-Grounded Blame Attribution:
- Instead of treating failure as a global signal, EVOTOOL uses a Blamer LLM to analyze the full execution trajectory ( $\tau$ ).
- It extracts structured diagnostic events (e.g., schema violations, wrong tool choices) and assigns a blame score ( $b_\pi \in [0,1]$ ) to each module.
- The module with the highest blame score is identified as the primary target for mutation ( $\pi^*$ ). This localizes the failure source.
Feedback-Guided Targeted Mutation:
- Once the target module is identified, a Mutator LLM generates natural-language feedback based on the specific failure trace.
- This feedback is used to edit only the specification of the blamed module (e.g., updating the Planner's prompt), while keeping the other three modules fixed.
- This ensures that improvements are precise and do not introduce unintended regressions in unrelated competencies.
Diversity-Aware Population Selection:
- To prevent the population of policies from collapsing into a single, narrow strategy (premature convergence), EVOTOOL maintains a population of candidate policies.
- Instead of selecting parents based solely on global average performance, it uses an instance-level winner criterion. A candidate is retained if it outperforms all others on at least one specific task instance in a held-out selection set.
- This preserves "specialist" policies that excel in specific sub-regions of the task distribution, ensuring complementary strengths are maintained.

3. Key Contributions

EVOTOOL Framework: A novel self-evolving system that optimizes modular tool-use policies via a gradient-free evolutionary approach, effectively solving the credit assignment problem in long-horizon tasks.
Blame-Aware Mutation: Introduces Trajectory-Grounded Blame Attribution and Feedback-Guided Targeted Mutation, which leverage diagnostic traces to localize failures and apply precise, natural-language edits to specific modules.
Diversity Preservation: Proposes Diversity-Aware Population Selection to prevent mode collapse, ensuring the agent retains a heterogeneous set of competencies rather than converging to a single suboptimal strategy.
Empirical Validation: Extensive evaluation across four diverse benchmarks demonstrating superior performance, efficiency, and transferability compared to state-of-the-art baselines.

4. Experimental Results

The authors evaluated EVOTOOL on four benchmarks: ToolBench (API generalization), RestBench (sequential REST APIs), $\tau$ -Bench (stateful, long-horizon interactions), and BFCL (function calling). They tested on both GPT-4.1 and Qwen3-8B.

Performance Gains: EVOTOOL outperformed all baselines (including monolithic optimizers like OPRO and single-aspect methods like DRAFT) by over 5 points on average across both models.
- On GPT-4.1, it achieved an overall average of 70.6, surpassing the best baseline (DRAFT) by ~6 points.
- On Qwen3-8B, it achieved 57.0, beating the second-best by 5.2 points.
Robustness in Complex Tasks: The framework showed particular strength in long-horizon tasks ( $\tau$ -Bench), where single-aspect methods failed (dropping to ~38-40%) due to ignoring cross-module dependencies, while EVOTOOL reached 52.0.
Efficiency: EVOTOOL achieved superior token efficiency. By restricting mutations to targeted components rather than re-optimizing the entire prompt, it reached higher performance with significantly lower token costs compared to global search methods.
Transferability: Policies evolved on one dataset (e.g., ToolBench) transferred effectively to others (e.g., RestBench) and across different backbone models (e.g., Qwen to GPT), indicating the learned behaviors are generalizable.

5. Significance

EVOTOOL addresses a fundamental bottleneck in LLM agent development: the difficulty of refining complex, multi-step tool-use policies when feedback is sparse and delayed. By shifting from "global black-box search" to modular, blame-aware, and diversity-preserving evolution, the paper demonstrates a scalable path toward robust, self-improving agents. The approach suggests that decomposing agent policies into distinct functional roles and applying targeted, trace-grounded corrections is a more effective strategy than treating the agent prompt as a monolithic entity. This has significant implications for building reliable agents in real-world, dynamic environments where errors can have cascading effects.