Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Imagine you've been training a brilliant student, let's call him "AI," for years. You've tested him on everything from math quizzes to history trivia. He's aced every test, getting perfect scores on the standard exams everyone uses to measure intelligence.

But then, you decide to give him a real-world job. You ask him to:

Diagnose a patient with a rare, confusing set of symptoms.
Draft a complex legal contract for a multi-million dollar merger.
Design a new school curriculum that actually helps struggling kids.

Suddenly, the AI starts stumbling. It gives generic answers, misses critical details, or gets confused by the messy, open-ended nature of real life. It turns out, being good at a multiple-choice test doesn't mean you're ready to be a doctor, a lawyer, or a teacher.

This is exactly the problem XpertBench is trying to solve.

The Problem: The "Exam Trap"

For a long time, we've measured AI intelligence using "exam-style" benchmarks. Think of these like high school standardized tests (SATs). They have clear questions and one right answer.

The Issue: AI has gotten so good at these tests that it's hitting a "ceiling." It's getting 99% on the SATs, but that doesn't tell us if it can actually do a job.
The Analogy: It's like judging a chef solely on how well they can recite a recipe from memory, without ever letting them cook a meal in a real kitchen with a messy stove and missing ingredients.

The Solution: XpertBench (The "Real World Internship")

The researchers at ByteDance Seed built XpertBench, which is less like a test and more like a rigorous, real-world internship.

Instead of asking the AI, "What is the capital of France?" they ask it: "Here is a messy financial report for two aerospace companies. Analyze their cash flow, compare their profit margins, and tell us which one is a safer investment for the next year, citing specific data."

Here is how they built it:

The "Real Bosses" (The Experts): They didn't just write these questions themselves. They hired over 1,000 actual experts—real doctors, lawyers, finance pros, and researchers. These are the people who actually do these jobs every day.
The "Job Description" (The Tasks): These experts wrote 1,346 complex tasks based on things they actually do. There are no "right answers" in a simple sense; there are only "good professional outcomes" and "bad ones."
The "Rubric" (The Grading Sheet): This is the secret sauce. In a normal test, you get a point for the right answer. In XpertBench, every task has a detailed checklist (a rubric) with 15 to 40 specific checkpoints.
- Example: Did the AI use the right legal clause? Did it calculate the tax correctly? Did it avoid making up facts?
- It's like a teacher grading an essay not just on "A or F," but on specific things like "Did you cite three sources?" "Is your grammar perfect?" "Did you address the counter-argument?"

The "ShotJudge" (The Smart Grader)

How do you grade 1,346 complex essays without hiring 1,000 human teachers? That would take forever and cost a fortune.

They invented ShotJudge.

The Analogy: Imagine a robot grader. If you just ask the robot to grade an essay, it might be biased or lazy. But, if you show the robot one perfect example of a human expert grading a similar essay first, the robot learns exactly how to think like a human expert.
They use a "few-shot" method: Show the AI judge a few examples of how a human expert scored a task, and then let the AI judge the rest. This keeps the grading consistent and fair, without needing a human for every single task.

What Did They Find? (The Results)

They ran the world's best AI models through this "Real World Internship." The results were a wake-up call:

The Ceiling is Low: Even the smartest AIs only got about 55% to 66% correct. That's a "C" or a "B" in a real-world job. They are far from being true "experts."
Specialization is Key: No single AI is good at everything.
- One AI was a Finance Wizard (great at money stuff) but terrible at Science.
- Another was a Lawyer (great at rules and logic) but struggled with creative writing.
- The Metaphor: It's like hiring a person who is a world-class swimmer but asking them to climb a mountain. They might be the best swimmer in the world, but they'll fail the mountain climb. We need to pick the right tool for the right job.
The "Hallucination" Trap: When the AI gets stuck, it doesn't just say "I don't know." It often starts making things up (hallucinating) or gets distracted by irrelevant information it found on the internet, leading to a complete breakdown in logic.

Why This Matters

This paper tells us that we are at a turning point. We can't just keep making AI smarter at trivia. To make AI truly useful as a professional partner (a "co-pilot" for doctors, lawyers, and engineers), we need to test it on real, messy, complex work.

XpertBench is the new standard. It's the difference between saying, "This AI is smart because it passed the test," and saying, "This AI is smart because it can actually do the job."

1. Problem Statement

Current Large Language Model (LLM) evaluation paradigms face a critical "plateau" where performance on traditional benchmarks (e.g., MMLU, GPQA) is saturating. These existing frameworks suffer from three primary limitations:

Narrow Scope & Exam-Style Bias: They rely on closed-form, multiple-choice questions that test static knowledge recall rather than the ill-structured, open-ended problem-solving required in real-world professional workflows.
Lack of Ecological Validity: Tasks often fail to capture the complexity of genuine expert cognition, such as navigating ambiguity, synthesizing conflicting constraints, and performing long-horizon planning.
Evaluation Reliability: Automated "LLM-as-a-judge" methods often suffer from self-rewarding biases and stylistic alignment rather than professional merit, while human evaluation is too costly and slow for large-scale benchmarking.

There is a significant gap between empirical benchmark scores and the actual utility of AI systems as professional co-pilots in high-stakes domains.

2. Methodology

The authors introduce XpertBench, a high-fidelity benchmark designed to evaluate LLMs on authentic, end-to-end expert workflows. The methodology is built on three core pillars:

A. Dataset Construction (Expert-Driven Curation)

Scale & Scope: The dataset contains 1,346 meticulously curated tasks across 80 categories spanning seven professional domains: Finance, Law, Healthcare, Education, Engineering & Applied Sciences (EAS), Computer Science (CS), and Humanities & Social Sciences (HSS).
Expert Sourcing: Tasks were derived from over 1,000 submissions by elite domain experts (including researchers from top-tier institutions and practitioners with CFA, CPA, MD, and JD credentials).
Task Nature: Unlike exam questions, these are open-ended, long-horizon tasks requiring strategic planning, deep research, and professional judgment.
Quality Control: A rigorous two-stage qualification process for experts and multi-stage peer review ensure tasks are representative of real-world challenges and avoid edge cases.

B. Granular Rubric-Based Evaluation

Rubric Design: Each task is evaluated using a detailed rubric containing 15–40 weighted checkpoints.
Dual-Weighting Scheme: Checkpoints are assigned both qualitative importance (Essential, Important, Optional) and quantitative weights (1–10) based on expert judgment.
Atomicity & Objectivity: Checkpoints are designed to be atomic and objectively verifiable (True/False) to minimize subjectivity.
Dimensions: Evaluation covers 16 dimensions including Instruction Following, Domain Expertise, Logical Coherence, Safety, and Normative Compliance.

C. ShotJudge Evaluation Paradigm

To bridge the gap between human precision and automated scalability, the authors propose ShotJudge:

Mechanism: An LLM judge (e.g., Gemini 2.5 Pro) is calibrated using few-shot exemplars derived from human expert annotations.
Process:
1. Expert Anchoring: Human experts generate "gold-standard" responses with binary scores and detailed rationales for a baseline model.
2. Meta-Evaluation: Senior experts review these annotations to ensure consistency.
3. Calibrated Scoring: The LLM judge evaluates candidate models by emulating the expert's reasoning patterns using the gold-standard rationales as in-context examples.
Metric: The final score is a weighted aggregation of binary checkpoint scores: $S = \frac{\sum w_i x_i}{\sum w_i}$ .
Validation: ShotJudge achieves a Consistency minus Discordance Rate (CDR) of 52.0%, significantly outperforming standard zero-shot LLM judges.

3. Key Contributions

XpertBench Benchmark: The first large-scale, multi-domain benchmark explicitly engineered for expert-level workflows, moving beyond knowledge recall to assess professional utility.
ShotJudge Framework: A novel, scalable evaluation pipeline that aligns automated scoring with human expert standards through few-shot calibration, mitigating self-bias.
Comprehensive Diagnostic Insights: A deep empirical analysis revealing specific failure modes in state-of-the-art (SOTA) models, including retrieval interference, principle hallucinations, and non-overlapping domain specializations.

4. Experimental Results

The authors evaluated 12 SOTA models (including GPT-5, Claude-Opus, Doubao, and Gemini families) on a curated gold subset ( $N=245$ ).

Performance Ceiling: Even the leading models struggle significantly. The best model, Claude-Opus-4.6-thinking, achieved a peak success rate of 66.20%, with a mean score across all models around 55%. This indicates a massive "expert-gap."
Domain-Specific Divergence: No single model dominates all fields.
- Finance: GPT-5.4-high leads with 84.65%.
- Law & Humanities: Claude-Opus-4.6-thinking leads (65.54% and 83.02% respectively).
- STEM/Education: Performance drops significantly for most models (e.g., GPT-5.4-high at 42.84% in STEM), highlighting difficulties with rigid formal logic and long-horizon planning.
Failure Modes:
- Retrieval Interference: Persistent web browsing often introduces noise that distracts from the core analytical trajectory.
- Principle Hallucination: Early conceptual errors cascade, rendering subsequent reasoning chains logically incoherent.
- Specialization: Models exhibit "non-overlapping strengths," suggesting the current "omni-capable" expert model does not yet exist.

5. Significance

Paradigm Shift: XpertBench marks a transition from evaluating LLMs as "general knowledge assistants" to assessing them as "specialized professional collaborators."
Real-World Utility: By focusing on ecological validity, the benchmark provides a more accurate predictor of AI performance in high-stakes industries like finance, law, and healthcare.
Future Research Direction: The results highlight that scaling data and parameters alone is insufficient; future progress requires improving reasoning stability, long-horizon planning, and domain-specific alignment to close the expert gap.
Evaluation Standard: The ShotJudge methodology offers a reproducible, scalable standard for evaluating complex generative tasks, potentially becoming a new industry norm for AI safety and capability assessment.