Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

Imagine you have a brilliant, super-smart robot assistant (an AI) that can talk to you and solve problems. Right now, most people treat this robot like a calculator: you give it a specific button to press (a tool), and it presses it. If the button is broken or missing, the robot just stops working.

"Tool-Genesis" is a new research paper that asks a much harder question: What if the robot could build its own buttons, fix broken ones, and even invent new tools from scratch just by listening to your vague description of a problem?

Here is the paper broken down into simple concepts and analogies:

1. The Problem: The "Black Box" of Robot Tools

Currently, when we test these robots, we usually give them a pre-made list of tools (like a toolbox with a hammer and a screwdriver already inside). We ask them to use the hammer. If they succeed, we say they are smart. If they fail, we just know they failed, but we don't know why. Did they pick the wrong hammer? Did they hold it wrong? Or did the hammer break because the robot built it poorly?

This is like a Black Box. You put a task in, and you see a result, but you can't see the messy middle part where the robot is actually trying to build the tool.

2. The Solution: Tool-Genesis (The "Architect" Exam)

The researchers created a new test called Tool-Genesis. Instead of giving the robot a toolbox, they give it a blank piece of paper and a vague request.

The Request: "I need to book a train ticket from Shanghai to Beijing, but I don't know the train number yet."
The Task: The robot must:
1. Design the Blueprint: Figure out what a "train booking tool" looks like (what information does it need? What does it return?).
2. Build the Tool: Write the actual code to make that tool work.
3. Test It: Make sure the tool actually works before using it.
4. Solve the Problem: Use that new tool to book the ticket.

It's like asking a carpenter to invent a new type of saw just because you said, "I need to cut this weirdly shaped wood," and then immediately using that saw to cut the wood.

3. The Big Discovery: "One-Shot" is Hard

The researchers found something surprising: Even the smartest AI models today struggle with this.

The Analogy: Imagine asking a genius architect to draw a house blueprint and build the house in one single try, without any mistakes.
The Reality: The AI often draws a blueprint with a missing door or a window in the wrong place. Because the blueprint is slightly wrong, the house (the tool) collapses.
The Domino Effect: A tiny mistake in the beginning (like a typo in the tool's instructions) gets amplified. By the time the robot tries to use the tool to solve your problem, the whole thing fails. The paper calls this a "precipitous drop."

4. The Fix: The "Code-Agent" (The Iterative Builder)

The paper tested a new way of working called Code-Agent. Instead of trying to build the tool in one perfect shot, the robot is allowed to:

Build a draft.
Try to run it.
See it crash.
Read the error message.
Fix the mistake.
Try again.

The Result: This "try, fail, fix" loop worked wonders. It's like a human programmer debugging their code. When the robot was allowed to see its own mistakes and fix them, its success rate skyrocketed. It went from being a clumsy builder to a competent engineer.

5. Why This Matters (The "Self-Evolving" Future)

The ultimate goal of this research is Self-Evolving Agents.

Old Way: Humans build a tool, give it to the robot, and the robot uses it.
New Way (Tool-Genesis): The robot learns from its failures. It builds a tool, realizes it's flawed, fixes it, and saves the improved version for next time.

Think of it like a video game character leveling up.

In the old days, the character just had a sword.
In the Tool-Genesis world, the character finds a broken sword, fixes it, sharpens it, and eventually forges a legendary sword that they keep in their inventory for future battles.

Summary

Tool-Genesis is a diagnostic test that stops treating AI as a simple button-pusher and starts treating it as a tool-maker. It reveals that while AI is good at using tools, it is currently terrible at building them from scratch without help. However, if we let the AI "debug" its own creations (like a human programmer), it can learn to build reliable, reusable tools that solve real-world problems.

The paper provides the "exam" (the benchmark) and the "study guide" (the data) to help AI researchers teach their robots to become better builders, not just better users.

Here is a detailed technical summary of the paper "Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent."

1. Problem Definition

Current research on self-evolving language agents focuses on their ability to create, adapt, and maintain tools. However, existing benchmarks suffer from three critical limitations that hinder the evaluation of true autonomous evolution:

Spec-First Bias: Most benchmarks assume pre-defined tool interfaces or high-quality reference specifications are available. They test the agent's ability to use a tool under a fixed contract rather than its ability to infer the interface and generate the implementation from abstract requirements.
Outcome-Centric "Black Box" Evaluation: Success is often measured solely by the final answer or a coarse call-level check. This obscures the specific stage of failure (e.g., was the tool interface wrong, the logic buggy, or the usage strategy suboptimal?).
Lack of Reusability Focus: Many evaluations treat tool creation as a one-off, disposable script generation task, failing to assess the agent's ability to distill capabilities into persistent, maintainable, and reusable tool assets.

Tool-Genesis addresses these gaps by formalizing a requirement-driven tool creation setting where agents must infer tool contracts from abstract natural language requirements, generate machine-checkable schemas, and produce executable logic without pre-set specifications.

2. Methodology

A. Benchmark Design & Formalization

The authors formalize tool creation as a conditional generation problem over Model Context Protocol (MCP) interfaces. The process is decomposed into two coupled phases:

Tool Interface Prediction: The model predicts a schema ( $\hat{s}$ ) specifying structured tool signatures (names, parameters, types) from a requirement ( $x$ ).
Tool Materialization: Conditioned on the schema, the model generates an executable server implementation ( $\hat{e}$ ).

The evaluation distinguishes between Oracle Materialization (using ground-truth schemas to test engineering capability) and Cascaded Materialization (using predicted schemas to test end-to-end performance).

B. Dataset Construction

The dataset is constructed through a rigorous four-stage pipeline (Figure 2):

Compliant MCP-Server Collection: Crawled from aggregators (Smithery, GLMA), GitHub, and HuggingFace. Servers undergo structure validation, executable validation (sandboxed launch), deduplication, and LLM-based semantic validation to ensure safety and self-containment.
- Result: 86 executable MCP servers with 508 tools across 24 domain classes.
High-Quality Task & Trajectory Generation: Tasks are synthesized using LLMs with rejection sampling to ensure diversity and solvability. Trajectories are generated in a sandbox to ensure they reflect real API behaviors, filtering out "hallucinated" successes.
- Result: 2,150 tasks and 2150 trajectories.
Comprehensive Unit Test Generation: Tests are extracted from valid trajectories and synthesized via LLMs to cover diverse inputs, including boundary and negative cases.
- Result: 9,441 unit tests.
Manual Quality Inspection: Graduate-level reviewers perform multi-pass checks for consistency, solvability, schema adherence, and test coverage.

C. Evaluation Protocol (Four-Level Metrics)

Tool-Genesis introduces a full-lifecycle diagnostic protocol to disentangle failure causes:

Level 1 (Surface Compliance): Measures if the tool registry is parseable (Compliance Rate) and if the server launches successfully (Server Execution Rate).
Level 2 (Semantic Interface Fidelity): Uses Schema-F1 to quantify the alignment between predicted and reference tool interfaces via bipartite matching.
Level 3 (Functional Correctness): Measures pass rates on Unit Tests, distinguishing between relaxed criteria (UTsoft) and strict boundary/negative tests (UThard).
Level 4 (Downstream Task Utility): Assesses end-to-end efficacy by having a fixed proxy agent solve tasks using the generated tools. Crucially, it calculates an Oracle-Normalized Success Rate (SR) to quantify the utility gap between generated tools and ground-truth reference tools.

3. Key Contributions

Tool-Genesis Benchmark: The first diagnostic benchmark that evaluates tool creation under missing specifications, focusing on the inference of contracts and generation of reusable assets rather than one-off scripts.
Diagnostic Evaluation Protocol: A unified, execution-grounded framework that decouples tool generation from utilization, enabling precise attribution of errors to interface design, logic implementation, or usage strategy.
Oracle-Normalized Utility Gap: A novel metric to quantify the practical self-evolution capability by comparing generated tools against an optimal reference implementation under the same task distribution.
Comprehensive Dataset: A high-quality dataset of 86 servers, 2,150 tasks, and 9,441 unit tests covering 24 diverse domains, constructed with rigorous filtering and manual verification.

4. Experimental Results

The authors evaluated a broad suite of frontier models (OpenAI GPT-4o/4.1/5.1, Anthropic Claude, Google Gemini, Qwen3, DeepSeek, Kimi) using two strategies: Direct (single-pass generation) and Code-Agent (ReAct-style loop with sandbox execution and repair).

The "Black Box" Bottleneck: Even state-of-the-art models struggle in a one-shot setting. Minor initial flaws in interface prediction or logic are amplified through the pipeline, causing a precipitous drop in downstream metrics.
Closed-Loop Repair is Critical: The Code-Agent strategy significantly outperforms Direct generation. For example, on Qwen3-235B, the Oracle-Normalized Success Rate (SR) jumped from ~0.19 (Direct) to ~0.62 (Code-Agent). Execution feedback allows models to iteratively fix bugs, transforming "plausible" schemas into functional tools.
Compliance $\neq$ Utility: High compliance (Level 1) and schema fidelity (Level 2) are necessary but insufficient. Models often generate syntactically correct schemas that fail strict unit tests (Level 3) or downstream tasks (Level 4), highlighting a "utility-conversion bottleneck."
Scale-Dependent Repair: Larger models benefit more from closed-loop repair. While smaller models (e.g., 4B/8B) struggle to utilize execution feedback effectively, larger models (e.g., 235B) can leverage it to achieve near-optimal performance.
Finetuning Efficacy: Finetuning on Tool-Genesis data improves both one-shot generation (Direct) and the effectiveness of closed-loop repair (Code-Agent), suggesting the benchmark provides a strong training signal for internalizing tool creation capabilities.

5. Significance

Tool-Genesis shifts the paradigm from "tool-using" to "tool-making" for self-evolving agents. It demonstrates that current models are not yet robust enough to autonomously create reliable, reusable tool assets in a single pass. The benchmark provides the community with:

A standardized way to diagnose where tool creation fails (schema vs. logic vs. usage).
Evidence that execution feedback loops are essential for bridging the gap between theoretical tool design and practical utility.
A pathway toward persistent tool assets, moving away from disposable scripts toward maintainable toolboxes that can evolve with long-horizon task distributions.

This work is a foundational step toward realizing truly self-evolving agents capable of operating in dynamic, real-world environments where tool specifications are often missing or ambiguous.