Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

The paper introduces XpertBench, a high-fidelity benchmark comprising 1,346 expert-verified tasks across diverse professional domains evaluated via detailed rubrics and a novel ShotJudge paradigm, which reveals that even state-of-the-art LLMs struggle with genuine expert-level cognition, achieving a peak success rate of only ~66%.

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Yueyan Qiu, Yanle Ren, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu

Published 2026-04-06
📖 5 min read🧠 Deep dive

Imagine you've been training a brilliant student, let's call him "AI," for years. You've tested him on everything from math quizzes to history trivia. He's aced every test, getting perfect scores on the standard exams everyone uses to measure intelligence.

But then, you decide to give him a real-world job. You ask him to:

  • Diagnose a patient with a rare, confusing set of symptoms.
  • Draft a complex legal contract for a multi-million dollar merger.
  • Design a new school curriculum that actually helps struggling kids.

Suddenly, the AI starts stumbling. It gives generic answers, misses critical details, or gets confused by the messy, open-ended nature of real life. It turns out, being good at a multiple-choice test doesn't mean you're ready to be a doctor, a lawyer, or a teacher.

This is exactly the problem XpertBench is trying to solve.

The Problem: The "Exam Trap"

For a long time, we've measured AI intelligence using "exam-style" benchmarks. Think of these like high school standardized tests (SATs). They have clear questions and one right answer.

  • The Issue: AI has gotten so good at these tests that it's hitting a "ceiling." It's getting 99% on the SATs, but that doesn't tell us if it can actually do a job.
  • The Analogy: It's like judging a chef solely on how well they can recite a recipe from memory, without ever letting them cook a meal in a real kitchen with a messy stove and missing ingredients.

The Solution: XpertBench (The "Real World Internship")

The researchers at ByteDance Seed built XpertBench, which is less like a test and more like a rigorous, real-world internship.

Instead of asking the AI, "What is the capital of France?" they ask it: "Here is a messy financial report for two aerospace companies. Analyze their cash flow, compare their profit margins, and tell us which one is a safer investment for the next year, citing specific data."

Here is how they built it:

  1. The "Real Bosses" (The Experts): They didn't just write these questions themselves. They hired over 1,000 actual experts—real doctors, lawyers, finance pros, and researchers. These are the people who actually do these jobs every day.
  2. The "Job Description" (The Tasks): These experts wrote 1,346 complex tasks based on things they actually do. There are no "right answers" in a simple sense; there are only "good professional outcomes" and "bad ones."
  3. The "Rubric" (The Grading Sheet): This is the secret sauce. In a normal test, you get a point for the right answer. In XpertBench, every task has a detailed checklist (a rubric) with 15 to 40 specific checkpoints.
    • Example: Did the AI use the right legal clause? Did it calculate the tax correctly? Did it avoid making up facts?
    • It's like a teacher grading an essay not just on "A or F," but on specific things like "Did you cite three sources?" "Is your grammar perfect?" "Did you address the counter-argument?"

The "ShotJudge" (The Smart Grader)

How do you grade 1,346 complex essays without hiring 1,000 human teachers? That would take forever and cost a fortune.

They invented ShotJudge.

  • The Analogy: Imagine a robot grader. If you just ask the robot to grade an essay, it might be biased or lazy. But, if you show the robot one perfect example of a human expert grading a similar essay first, the robot learns exactly how to think like a human expert.
  • They use a "few-shot" method: Show the AI judge a few examples of how a human expert scored a task, and then let the AI judge the rest. This keeps the grading consistent and fair, without needing a human for every single task.

What Did They Find? (The Results)

They ran the world's best AI models through this "Real World Internship." The results were a wake-up call:

  1. The Ceiling is Low: Even the smartest AIs only got about 55% to 66% correct. That's a "C" or a "B" in a real-world job. They are far from being true "experts."
  2. Specialization is Key: No single AI is good at everything.
    • One AI was a Finance Wizard (great at money stuff) but terrible at Science.
    • Another was a Lawyer (great at rules and logic) but struggled with creative writing.
    • The Metaphor: It's like hiring a person who is a world-class swimmer but asking them to climb a mountain. They might be the best swimmer in the world, but they'll fail the mountain climb. We need to pick the right tool for the right job.
  3. The "Hallucination" Trap: When the AI gets stuck, it doesn't just say "I don't know." It often starts making things up (hallucinating) or gets distracted by irrelevant information it found on the internet, leading to a complete breakdown in logic.

Why This Matters

This paper tells us that we are at a turning point. We can't just keep making AI smarter at trivia. To make AI truly useful as a professional partner (a "co-pilot" for doctors, lawyers, and engineers), we need to test it on real, messy, complex work.

XpertBench is the new standard. It's the difference between saying, "This AI is smart because it passed the test," and saying, "This AI is smart because it can actually do the job."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →