SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee

Published 2026-03-10

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

Imagine you have a brilliant, super-smart intern. This intern (the AI Agent) knows a little bit about everything—history, math, coding, and cooking—because they read the entire internet. But, if you ask them to fix a specific type of industrial machine or file a complex tax return for a hedge fund, they might freeze. They know the concepts, but they don't know the step-by-step recipe for that specific job.

"Skills" are like giving that intern a specialized, pre-written cookbook or a "cheat sheet" for that specific job.

The paper SkillsBench is essentially a giant report card that asks: "Do these cheat sheets actually help the intern do their job better?"

Here is the breakdown of their findings, using some everyday analogies:

1. The Setup: The "Cheat Sheet" Experiment

The researchers created a massive testing ground called SkillsBench.

The Test: They gave 84 different tasks to 7 different "interns" (AI models).
The Conditions:
1. No Cheat Sheet: The intern tries to figure it out from scratch.
2. The Perfect Cheat Sheet: A human expert wrote a clear, step-by-step guide (a "Skill") and gave it to the intern.
3. The "Make Your Own" Cheat Sheet: The intern was told, "You don't have a guide, so write your own guide first, then do the job."

2. The Big Wins: Human Guides Work Wonders

The Result: When the intern was given a human-written cheat sheet, they got much better at their jobs.

The Analogy: Imagine a chef who knows how to cook generally. If you give them a specific, well-written recipe for "Sourdough Bread," they can bake perfect bread. Without it, they might guess and burn the loaf.
The Stats: On average, the cheat sheets improved success rates by 16%.
The Surprise: The improvement wasn't the same for everyone.
- Healthcare & Manufacturing: The cheat sheets were magic here. Success jumped by over 50%. It's like giving a mechanic a specific manual for a new car model they've never seen before.
- Software Engineering: The improvement was smaller. Why? Because the intern already read a lot of code online, so they didn't need the manual as much.

3. The Big Fail: AI Can't Write Its Own Manuals

The Result: When the AI was told to write its own cheat sheet before doing the task, it didn't help at all. In fact, it sometimes made things worse.

The Analogy: Imagine asking a student to write their own study guide for a physics exam, and then taking the exam using only that guide. They might write down the wrong formulas or miss a key step because they don't actually know the material deeply enough to teach it.
The Lesson: AI is great at using knowledge, but it's currently terrible at creating the precise, structured instructions it needs to succeed. It needs a human to curate the "Skills."

4. The "Less is More" Rule

The Result: The researchers found that short, focused cheat sheets worked better than massive, 100-page manuals.

The Analogy: If you are trying to fix a leaky faucet, you don't want a 500-page book on "The History of Plumbing." You want a 3-step card that says: "1. Turn off water. 2. Replace washer. 3. Turn on."
The Finding: Cheat sheets with just 2 or 3 steps were the sweet spot. If the guide was too long and complicated, the AI got confused and ignored it.

5. The "Small Intern" vs. The "Big Intern"

The Result: A smaller, cheaper AI model with a good cheat sheet could often beat a massive, expensive AI model that had no cheat sheet.

The Analogy: A junior employee with a perfect, detailed checklist can often do a specific task better than a senior executive who is trying to wing it without notes. The checklist bridges the gap in experience.

Summary: What Does This Mean for the Future?

The paper tells us that AI isn't just about making the brain bigger; it's about giving it better tools.

Don't just rely on the AI's memory: It needs human-curated "Skills" (procedural guides) to handle complex, real-world jobs.
Keep it simple: Don't write long manuals for the AI. Give it short, clear, step-by-step instructions.
Human expertise is still king: The AI cannot write its own instructions yet. Humans need to be the "authors" of these skills to make the AI truly useful.

In short: AI is a powerful engine, but "Skills" are the GPS and the instruction manual. Without them, the engine is just spinning its wheels.

Here is a detailed technical summary of the paper "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks."

1. Problem Statement

Large Language Models (LLMs) have evolved into autonomous agents capable of executing complex, multi-step tasks. However, a fundamental tension exists: foundation models possess broad capabilities but lack the specific procedural knowledge required for domain-specific workflows (e.g., clinical data harmonization, specific manufacturing protocols). While fine-tuning is an option, it is expensive and sacrifices generality.

Agent Skills have emerged as a solution: structured packages of instructions, code templates, and resources that augment agents at inference time without model modification. Despite rapid adoption (e.g., in Claude Code, Gemini CLI), there is no standardized benchmark to evaluate:

Whether Skills actually improve agent performance.
Which domains benefit most.
What design principles make a Skill effective.
Whether models can reliably self-generate the procedural knowledge they need.

2. Methodology: The SkillsBench Framework

The authors introduce SKILLSBENCH, the first benchmark treating Skills as first-class evaluation artifacts.

A. Dataset Construction

Scale: 84 tasks across 11 diverse domains (Healthcare, Manufacturing, Cybersecurity, Finance, Software Engineering, etc.).
Sources: 105 contributors (academia/industry) submitted 322 candidates; 84 were curated after rigorous filtering.
Task Structure: Each task is a containerized Docker environment containing:
- instruction.md: Human-written task description.
- environment/: Data files and a skills/ directory.
- solution/: An oracle solution (100% pass rate).
- tests/: Deterministic pytest verifiers (no LLM-as-a-judge).
Skill Definition: A modular package (SKILL.md + resources) providing procedural guidance (SOPs, workflows) rather than factual retrieval or specific solutions.
Quality Control: Strict leakage audits ensure Skills do not contain task-specific answers (e.g., exact filenames or magic numbers).

B. Experimental Setup

Agents & Models: Evaluated 7 configurations across 3 commercial harnesses:
- Claude Code (Opus 4.5, Opus 4.6, Sonnet 4.5, Haiku 4.5).
- Gemini CLI (Gemini 3 Pro, Gemini 3 Flash).
- Codex CLI (GPT-5.2).
Conditions: Each task was run under three conditions:
1. No Skills: Baseline (vanilla agent).
2. Curated Skills: Full set of human-authored procedural guides provided.
3. Self-Generated Skills: Agent prompted to generate its own procedural knowledge before solving (isolating latent knowledge).
Metrics: Primary metric is Pass Rate (averaged over 5 trials). Also calculated Normalized Gain ( $g$ ) to measure proportional improvement relative to the ceiling.
Total Trajectories: 7,308 valid trajectories.

3. Key Contributions

First Skills-Centric Benchmark: Establishes a framework to measure the efficacy of augmentation rather than just raw model capability.
Large-Scale Empirical Evaluation: Provides the first systematic evidence on how Skills impact performance across different models, harnesses, and domains.
Design Principles for Skills: Identifies that "less is more" (focused, 2–3 module Skills outperform comprehensive documentation).
Self-Generation Analysis: Demonstrates that current models cannot reliably author the procedural knowledge they need to succeed.

4. Key Results

A. Efficacy of Curated Skills

Overall Improvement: Curated Skills increased the average pass rate by +16.2 percentage points (pp).
Domain Variance: Benefits vary wildly by domain:
- High Impact: Healthcare (+51.9pp), Manufacturing (+41.9pp). These domains rely on specialized procedures underrepresented in pretraining.
- Low Impact: Software Engineering (+4.5pp), Mathematics (+6.0pp). Models already have strong priors here; Skills sometimes add overhead.
Negative Cases: 16 of 84 tasks showed negative deltas, indicating Skills can introduce conflicting guidance for tasks models already handle well.

B. Self-Generated Skills Failure

Result: Self-generated Skills provided negligible or negative benefit (average -1.3pp).
Implication: Models cannot reliably synthesize the specific, error-free procedural steps required for complex tasks. They often generate imprecise advice or fail to recognize the need for specialized knowledge.

C. Design Factors

Quantity: 2–3 Skills per task yielded optimal results (+18.6pp). Providing 4+ Skills caused diminishing returns (+5.9pp) due to cognitive overhead.
Complexity: Focused/Compact Skills outperformed comprehensive documentation. Exhaustive docs often consumed context budget without providing actionable guidance.
Model Scaling: Smaller models with Skills (e.g., Claude Haiku 4.5 + Skills) could outperform larger models without Skills (e.g., Claude Opus 4.5 without Skills), suggesting Skills can partially compensate for model capacity.

D. Harness Reliability

Claude Code: Showed the highest utilization of Skills and the largest absolute gains (+23.3pp for Opus 4.5), likely due to native integration.
Gemini CLI: Achieved the highest raw pass rate (48.7% with Flash) but showed lower relative gains from Skills compared to Claude.
Codex: Often acknowledged Skills but frequently ignored them to solve tasks independently.

5. Significance and Implications

Context-Dependent Value: Skills are not a universal fix. Their value is highest in domains requiring specialized, brittle procedural knowledge (Healthcare, Manufacturing) and lowest in domains with strong pretraining coverage (General Software Engineering).
Human Curation is Critical: The failure of self-generated Skills proves that effective augmentation requires human-curated domain expertise.
Efficiency: Focused, modular Skills are more effective than long documentation. This guides developers to write concise, step-by-step guides rather than encyclopedic manuals.
Future Research: The benchmark provides a standard for evaluating agent augmentation, moving beyond "can the model do X?" to "how much does context Y help the model do X?"

Conclusion: SKILLSBench establishes that while Agent Skills are a powerful tool for bridging the gap between general LLM capabilities and specialized workflows, their effectiveness is highly contingent on domain specificity, Skill design quality, and the agent harness's ability to utilize them. The dataset and evaluation harness are open-sourced at skillsbench.ai.