SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

The paper introduces SkillsBench, a comprehensive benchmark demonstrating that while curated agent skills significantly boost LLM performance across diverse domains—often allowing smaller models to match larger ones—self-generated skills offer no benefit and effects vary widely by task.

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, super-smart intern. This intern (the AI Agent) knows a little bit about everything—history, math, coding, and cooking—because they read the entire internet. But, if you ask them to fix a specific type of industrial machine or file a complex tax return for a hedge fund, they might freeze. They know the concepts, but they don't know the step-by-step recipe for that specific job.

"Skills" are like giving that intern a specialized, pre-written cookbook or a "cheat sheet" for that specific job.

The paper SkillsBench is essentially a giant report card that asks: "Do these cheat sheets actually help the intern do their job better?"

Here is the breakdown of their findings, using some everyday analogies:

1. The Setup: The "Cheat Sheet" Experiment

The researchers created a massive testing ground called SkillsBench.

  • The Test: They gave 84 different tasks to 7 different "interns" (AI models).
  • The Conditions:
    1. No Cheat Sheet: The intern tries to figure it out from scratch.
    2. The Perfect Cheat Sheet: A human expert wrote a clear, step-by-step guide (a "Skill") and gave it to the intern.
    3. The "Make Your Own" Cheat Sheet: The intern was told, "You don't have a guide, so write your own guide first, then do the job."

2. The Big Wins: Human Guides Work Wonders

The Result: When the intern was given a human-written cheat sheet, they got much better at their jobs.

  • The Analogy: Imagine a chef who knows how to cook generally. If you give them a specific, well-written recipe for "Sourdough Bread," they can bake perfect bread. Without it, they might guess and burn the loaf.
  • The Stats: On average, the cheat sheets improved success rates by 16%.
  • The Surprise: The improvement wasn't the same for everyone.
    • Healthcare & Manufacturing: The cheat sheets were magic here. Success jumped by over 50%. It's like giving a mechanic a specific manual for a new car model they've never seen before.
    • Software Engineering: The improvement was smaller. Why? Because the intern already read a lot of code online, so they didn't need the manual as much.

3. The Big Fail: AI Can't Write Its Own Manuals

The Result: When the AI was told to write its own cheat sheet before doing the task, it didn't help at all. In fact, it sometimes made things worse.

  • The Analogy: Imagine asking a student to write their own study guide for a physics exam, and then taking the exam using only that guide. They might write down the wrong formulas or miss a key step because they don't actually know the material deeply enough to teach it.
  • The Lesson: AI is great at using knowledge, but it's currently terrible at creating the precise, structured instructions it needs to succeed. It needs a human to curate the "Skills."

4. The "Less is More" Rule

The Result: The researchers found that short, focused cheat sheets worked better than massive, 100-page manuals.

  • The Analogy: If you are trying to fix a leaky faucet, you don't want a 500-page book on "The History of Plumbing." You want a 3-step card that says: "1. Turn off water. 2. Replace washer. 3. Turn on."
  • The Finding: Cheat sheets with just 2 or 3 steps were the sweet spot. If the guide was too long and complicated, the AI got confused and ignored it.

5. The "Small Intern" vs. The "Big Intern"

The Result: A smaller, cheaper AI model with a good cheat sheet could often beat a massive, expensive AI model that had no cheat sheet.

  • The Analogy: A junior employee with a perfect, detailed checklist can often do a specific task better than a senior executive who is trying to wing it without notes. The checklist bridges the gap in experience.

Summary: What Does This Mean for the Future?

The paper tells us that AI isn't just about making the brain bigger; it's about giving it better tools.

  • Don't just rely on the AI's memory: It needs human-curated "Skills" (procedural guides) to handle complex, real-world jobs.
  • Keep it simple: Don't write long manuals for the AI. Give it short, clear, step-by-step instructions.
  • Human expertise is still king: The AI cannot write its own instructions yet. Humans need to be the "authors" of these skills to make the AI truly useful.

In short: AI is a powerful engine, but "Skills" are the GPS and the instruction manual. Without them, the engine is just spinning its wheels.