SkillCraft: Teaching AI Agents to Stop Repeating Themselves

Imagine you are hiring a very smart, but slightly obsessive, personal assistant to help you with a massive project.

The Old Way (Current AI Agents):
Every time you ask your assistant to "Find the price of a red shirt, then a blue shirt, then a green shirt," they treat each request as a brand-new mystery.

They read the instructions.
They open the website for the red shirt.
They read the price.
They close the tab.
They read the instructions again for the blue shirt.
They open the website for the blue shirt... and so on.

They are incredibly smart, but they are inefficient. They are like a student who solves every math problem from scratch, even if the last 10 problems used the exact same formula. They get tired, they make mistakes, and they use up a lot of paper (or in the AI world, "tokens" and money).

The New Way (SkillCraft):
The SkillCraft paper introduces a new way for AI agents to work. Instead of solving every single step from scratch, the agent learns to create its own shortcuts.

Think of it like this:
After the assistant finds the price of the red shirt, they realize, "Hey, I just did this whole process: Open site -> Search 'red' -> Read price -> Close tab."

Instead of doing it again for the blue shirt, they write this process down on a sticky note and label it "Find Price."
Now, when you ask for the blue shirt, they just grab the "Find Price" sticky note, plug in "blue," and hit execute. They don't need to re-read the whole manual.

What is SkillCraft?

SkillCraft is a giant training gym and a scoreboard for these AI assistants. Its goal is to see if AI can learn to:

Notice patterns: "I'm doing the same thing over and over."
Create a "Skill": Turn that pattern into a reusable tool (like a macro or a script).
Reuse the Skill: Use that tool for future tasks to save time and energy.

The researchers built 126 different "workouts" (tasks) that get progressively harder. Some are simple (finding info on 3 cats), and some are massive (analyzing 5 different software projects with dozens of steps each).

The Three Stages of Learning

The paper describes how the AI learns these skills in three phases:

The Exploration Phase (The "Try Everything" Stage):
The AI is given a task and has to figure out how to do it using basic tools. It might stumble around, trying different buttons and websites.
- Analogy: You are in a new kitchen trying to make a sandwich. You open every drawer, look at every ingredient, and figure out where the bread and knife are.
The Composition Phase (The "Write the Recipe" Stage):
Once the AI successfully makes the sandwich, it looks at what it just did. It realizes, "I did Step A, then Step B, then Step C." It writes this down as a Skill (a reusable recipe).
- Analogy: You write down the recipe: "1. Get bread, 2. Get cheese, 3. Assemble." Now you have a "Sandwich Skill."
The Reuse Phase (The "Chef Mode" Stage):
When you ask for a second sandwich (or a third, or a hundredth), the AI doesn't open the drawers again. It just grabs the "Sandwich Skill" recipe and runs it.
- Analogy: You can now make 50 sandwiches in the time it used to take to make one.

What Did They Find?

The researchers tested the smartest AI models available (like GPT-5, Claude, and Gemini) in this gym. Here is what happened:

Huge Savings: When the AI started using its own "Skills," it saved up to 80% of the computing power (tokens) and money. It was like switching from driving a gas-guzzling truck to an electric scooter.
Smarter Models Learn Faster: The most intelligent models were the best at spotting patterns and creating good skills. They knew when to make a shortcut and when to just do the work manually.
The "Too Deep" Trap: The researchers tried to see if AI could make "Skills inside of Skills" (like a recipe that calls another recipe). They found this often backfired. If the inner recipe had a tiny mistake, the whole chain broke.
- Lesson: Simple, well-tested shortcuts are better than complex, nested ones.
Sharing is Caring: If a really smart AI (like Claude) created a perfect "Skill" for a task, a slightly less smart AI could use that same skill and do the job almost as well. The quality of the skill mattered more than the model using it.

Why Does This Matter?

Right now, AI is getting better at doing one thing at a time. But the real world is messy and repetitive. We don't want AI to just answer questions; we want it to work with us over long periods.

SkillCraft shows that for AI to be truly useful in the real world, it needs to stop being a robot that follows orders line-by-line and start being a smart worker who organizes its own workflow.

Without Skills: AI is a tireless but clumsy worker who forgets what they did five minutes ago and has to re-read the manual for every single task.
With Skills: AI becomes a seasoned professional who builds a toolbox of shortcuts, gets the job done faster, makes fewer mistakes, and saves everyone money.

In short: SkillCraft teaches AI to stop reinventing the wheel and start driving the car.

Here is a detailed technical summary of the paper "SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?"

1. Problem Statement

Current benchmarks for Large Language Model (LLM) agents primarily evaluate instance-level success using static sets of atomic tools. They typically ask: "Can the agent solve this specific task with the given tools?"

The Gap: These benchmarks fail to measure an agent's ability to abstract, accumulate, and reuse higher-level tool compositions (skills) across multiple tasks or within long-horizon workflows.
The Challenge: Real-world agent operations involve recurring substructures (e.g., repeated search-analyze-summarize patterns). Effective behavior requires not just executing isolated actions but forming compositional skills—reusable tool chains that capture shared structure. Existing agents often solve tasks from scratch every time, leading to inefficiency (high token usage, context window saturation) and an inability to generalize procedural knowledge.

2. Methodology: The SkillCraft Framework

The authors introduce SkillCraft, a benchmark and evaluation protocol designed to stress-test an agent's ability to form and reuse skills.

A. Benchmark Construction (Three-Stage Pipeline)

Exploratory Phase: Analyzed existing benchmarks (Toolathlon, WebArena, AgentCompany) to identify design principles for long-horizon, repetitive tasks.
Seed Task Creation: Constructed 21 seed tasks from three sources:
- Adapted high-quality tasks from existing benchmarks.
- Handcrafted web API tasks (e.g., GitLab, Open-Meteo, TVMaze).
- Local file/data processing tasks.
Systematic Scaling: Scaled task difficulty along two orthogonal dimensions to force skill abstraction:
- Quantitative Scaling: Increasing the number of entities (e.g., analyzing 3 repos vs. 5 repos).
- Complexity Scaling: Increasing the number of tool calls per subtask (e.g., fetching commits + identifying contributors + correlating data).
- Result: A final pool of 126 tasks across 6 difficulty levels and 6 domains (Entertainment, Reference, Education, Developer, Science, Food).

B. The Skill Mode Protocol

The core innovation is a lightweight evaluation protocol called Skill Mode, which enables agents to evolve their action space at test time.

Mechanism: Agents interact with a Skill Library via four minimal Model Context Protocol (MCP) primitives:
1. save_skill: Persist a successful tool sequence as executable code.
2. get_skill: Retrieve code and metadata.
3. list_skills: Discover available skills.
4. execute_skill: Run a saved skill as a higher-level tool.
Workflow:
1. Reuse Attempt: The agent checks if an existing skill solves the current task.
2. Exploration: If no skill exists, the agent solves the task using atomic tools.
3. Composition: The successful atomic sequence is abstracted into a parameterized Python script (the "Skill").
4. Verification: A Coding Verifier validates the skill via syntax checks, runtime error reporting, and post-execution quality detection (filtering out silent failures).
5. Accumulation: Validated skills are cached and reused in subsequent tasks.

3. Key Contributions

SkillCraft Benchmark: The first benchmark explicitly designed to measure compositional skill acquisition and cross-task reuse rather than single-instance performance.
Skill Mode Protocol: A plug-and-play mechanism allowing agents to auto-compose, cache, and reuse tool chains as executable code, simulating human-like skill accumulation.
Evaluation of Hierarchical vs. Flat Skills: The paper investigates whether deep, nested skill hierarchies (skills calling other skills) are beneficial.
Cross-Model Generalization Analysis: An experiment testing if skills created by one model can be effectively executed by others.

4. Key Results

The authors evaluated state-of-the-art models (GPT-5.2, Claude 4.5 Sonnet, Gemini 3 Pro, DeepSeek, etc.) on SkillCraft.

A. Efficiency and Success Gains

Token Reduction: Skill Mode reduced token usage by up to 80% (e.g., GPT-5.2 dropped from 1.23M to 0.26M tokens per task).
Cost Reduction: Corresponding cost reductions of up to 75%.
Success Rate: Stronger models saw success rate improvements (e.g., GPT-5.2 from 87% to 90%; DeepSeek-V3.2 from 60% to 69%).
Correlation: There is a strong positive correlation ( $r=0.65$ ) between a model's ability to execute generated skills and its overall task success, indicating that coding ability is tightly coupled with compositional intelligence.

B. Insights on Skill Composition

Depth vs. Robustness: Flat (shallow) skill libraries are more reliable than deep, hierarchical compositions.
- Reason: Hierarchical modes suffer from error propagation. A null value or bug in a low-level skill cascades up, causing high-level failures. While hierarchical modes showed high execution rates, they often resulted in lower overall task success compared to flat modes.
Cross-Task Generalization: Skills learned at one difficulty level (e.g., Easy) transfer effectively to harder levels (Hard), improving both success rates and efficiency.
Cross-Model Generalization:
- Creator Quality > Executor Capability: Skills created by high-capability models (e.g., Claude) achieved 100% success across all executor models (including weaker ones).
- Inefficiency of Poor Skills: Skills created by weaker models often increased computational costs (negative token savings) when executed by other models, highlighting that skill quality is the bottleneck, not just the executor's capability.

C. Model Behavior Analysis

Strong Models: Exhibit "judgment," creating skills only when the abstraction overhead is justified by the task complexity (e.g., Claude avoided skills for simple tasks but used them effectively for complex ones).
Weaker Models: Tend to follow prompts rigidly, attempting to create skills even for simple tasks or persisting through multiple failed skill creation attempts, leading to inefficiency.

5. Significance

Redefining Agent Evaluation: Shifts the focus from "Can the agent solve this?" to "Can the agent learn to solve this better over time?" This aligns with cognitive science definitions of intelligence as the efficiency of skill acquisition.
Practical Efficiency: Demonstrates that code-based abstraction is a viable path to drastically reducing the cost and latency of LLM agents in real-world, long-horizon workflows.
Design Guidance: Suggests that for robust agent systems, flat, well-tested skill libraries are currently superior to complex hierarchical nesting due to error propagation risks.
Future Direction: Highlights that the next frontier in agent intelligence is not just better reasoning, but better procedural abstraction and library management.

In summary, SkillCraft proves that LLM agents can learn to use tools skillfully by treating tool chains as reusable code artifacts. This approach yields massive efficiency gains and higher success rates, provided the agents possess the judgment to abstract correctly and the coding ability to generate robust, error-free skills.

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?