EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

EvoTool is a self-evolving framework that optimizes modular LLM agent tool-use policies through a gradient-free evolutionary process featuring blame-aware mutation and diversity-aware selection, significantly outperforming existing baselines in accuracy, efficiency, and transferability.

Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, Eduard Hovy

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you hire a team of four specialists to solve a complex problem for you, like planning a cross-country road trip that involves booking flights, finding hotels, and reserving rental cars.

In the world of AI, this "team" is an LLM Agent (a smart computer program) that uses external tools (like APIs) to get things done. The paper introduces a new system called EVOTOOL to teach this team how to work better.

Here is the breakdown of the problem and the solution, using simple analogies.

The Problem: The "Blame Game" and the "Sledgehammer"

When the team fails to book the trip, it's usually because of a long chain of events. Maybe the planner picked the wrong dates, the selector chose the wrong airline, the caller typed the wrong code, or the synthesizer wrote a confusing summary.

Current methods of fixing AI agents have two big flaws:

  1. The Sledgehammer Approach (Monolithic): If the trip fails, the old way is to tell the entire team, "You all messed up! Change everything!" This is like firing the whole crew and hiring new ones because one person made a typo. It often breaks things that were working fine.
  2. The Guessing Game (Single-Aspect): Another way is to just tweak one person, say, the Planner, without knowing if they were actually the problem. It's like blaming the chef for a burnt steak when the fire was actually too hot.

Because the AI only gets a "Pass" or "Fail" at the very end, it doesn't know who is to blame. This is called the Credit Assignment Problem.

The Solution: EVOTOOL (The Smart Coach)

EVOTOOL is a self-improving system that acts like a smart coach who watches the game, finds the exact mistake, and gives specific advice to the right player. It breaks the AI agent into four distinct roles (modules):

  1. The Planner: Breaks the big task into small steps.
  2. The Selector: Decides which tool (like which API) to use.
  3. The Caller: Actually uses the tool and fills in the forms.
  4. The Synthesizer: Puts all the answers together into a final response.

EVOTOOL improves this team using three clever tricks:

1. Trajectory-Grounded Blame Attribution (The Detective)

Instead of guessing who failed, EVOTOOL acts like a detective. It reviews the "replay" of the agent's actions (the trajectory).

  • Analogy: Imagine a sports referee reviewing the video. If the team lost because the quarterback threw the ball to the wrong receiver, the referee doesn't blame the coach or the kicker. They point specifically at the quarterback and say, "You made the wrong choice."
  • How it works: The system analyzes the error logs and assigns a "blame score" to each of the four modules. It identifies exactly which module caused the failure.

2. Feedback-Guided Targeted Mutation (The Personal Trainer)

Once the detective finds the culprit, the coach doesn't rewrite the whole playbook. They give specific, natural language feedback to just that one person.

  • Analogy: If the quarterback is the problem, the coach doesn't fire the whole team. They pull the quarterback aside and say, "Next time, check the receiver's position before throwing." The rest of the team (the kicker, the defense) keeps doing what they are good at.
  • How it works: The system edits only the instructions (the prompt) for the blamed module, leaving the others untouched. This prevents "regressions" (breaking things that were already working).

3. Diversity-Aware Population Selection (The Talent Scout)

In evolution, if you only pick the single "best" candidate, you might lose unique skills. Maybe one agent is great at booking flights but bad at hotels, while another is the opposite.

  • Analogy: A talent scout doesn't just hire the one person who won the last race. They keep a diverse team: a sprinter, a marathon runner, and a jumper. They know that different tasks need different strengths.
  • How it works: EVOTOOL keeps a "population" of different agent versions. It doesn't just pick the one with the highest average score; it keeps agents that are the best at specific types of tasks. This ensures the team doesn't become too narrow and can handle a wide variety of problems.

The Results: Why It Matters

The paper tested EVOTOOL on four different "obstacle courses" (benchmarks) involving complex tool use.

  • The Outcome: EVOTOOL beat the best existing methods by a significant margin (over 5 points).
  • Efficiency: It learned faster and used fewer computer resources (tokens) because it didn't waste time retraining the whole team, just the specific part that needed fixing.
  • Transferability: The skills it learned on one type of task (like booking flights) transferred well to other tasks (like managing databases), proving it learned how to use tools, not just memorized answers.

Summary

EVOTOOL is like a self-improving robot coach. Instead of blindly guessing why an AI failed or rewriting its entire brain, it:

  1. Investigates the replay to find the exact mistake.
  2. Coaches only the specific part of the brain that made the mistake.
  3. Keeps a diverse team so it doesn't lose unique talents.

This makes AI agents more reliable, efficient, and better at solving complex, real-world problems.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →