\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Imagine you've spent years teaching a robot to take tests. It can ace a multiple-choice quiz on history, solve a math problem, or write a poem about a cat. You think, "Wow, this robot is a genius! It's ready for the real world!"

But then, you hand it a real job: "Go fix a leaking pipe in a 100-year-old building while the owner is watching, and don't break the antique tiles." Suddenly, the robot freezes. It knows what a pipe is, but it doesn't know how to be a plumber in a messy, high-stakes situation.

That is exactly the problem the paper $OneMillion-Bench is trying to solve.

Here is the story of the paper, broken down into simple concepts and analogies.

1. The Problem: The "Driving Test" vs. The "Taxi Driver"

For a long time, we tested AI (Language Models) like we test drivers in a driving school. We put them in a simulator with perfect weather, clear lines, and no other cars. They pass the test easily.

But in the real world, being a professional (like a lawyer, a doctor, or a financial analyst) is like driving a taxi in a chaotic city during a storm. You have to:

Find the right map (search for information).
Ignore fake road signs (spot conflicting evidence).
Follow strict traffic laws (comply with regulations).
Make decisions that cost money if you get it wrong.

Current AI benchmarks are like the driving school simulator. They don't tell us if the AI can actually handle the chaos of a real job.

2. The Solution: The "$1 Million Job Interview"

The authors created a new test called $OneMillion-Bench.

Instead of asking the AI to solve a math equation, they gave it 400 real-world professional tasks. These aren't made-up questions; they are things a senior expert would actually do.

The Lawyer: "Check this contract for a loophole that could cost us millions."
The Doctor: "Design a treatment plan for a patient with a rare condition, considering their specific insurance rules."
The Engineer: "Fix this code that is crashing our server."

Why is it called "$1 Million"?
The researchers calculated how much it would cost to hire a human expert to do these tasks. They added up the hours and the hourly wage of top professionals in the US and China. The total value of all the tasks in the test is over $1 million.

Think of it this way: If an AI can do these tasks, it's not just "smart"; it's economically valuable. It's like saying, "This robot can earn you a million dollars in work."

3. How They Grade the AI: The "Rubric" (The Scorecard)

In school, you get a grade based on the final answer (Right or Wrong). In the real world, how you get the answer matters just as much as the answer itself.

If a lawyer gives you the right verdict but cites a law that was repealed 20 years ago, they are still in trouble.

So, the researchers created a Rubric (a detailed scorecard) for every task. It's like a judge's checklist:

Did you find the right source? (Factual Accuracy)
Did you follow the rules? (Professional Compliance)
Is your logic sound? (Reasoning)
Did you make a dangerous mistake? (Negative Penalties)

They even gave negative points for bad behavior. If the AI hallucinates (makes things up) or ignores a safety rule, it gets docked points, just like a driver getting a ticket for speeding.

4. The Results: Who Passed the Test?

The researchers tested 35 different AI models (the "contestants") on this million-dollar job interview. Here's what they found:

The "Search" Superpower: The best performers were the ones allowed to use Web Search. It's like giving the driver a GPS and a radio to check traffic. However, for some weaker models, the GPS just confused them, and they crashed.
The "Deep Research" Trap: Some models were built specifically to do long, deep research. They did okay, but they didn't beat the smartest general-purpose models that were just given a good search tool. It turns out, being a "generalist with a good tool" is often better than being a "specialist with a bad tool."
The Gap is Huge: Even the best AI (Claude-Opus-4.6) only passed about 43% of the tasks. That means for more than half the jobs, the AI wasn't ready to work alone yet. It needs a human supervisor.
The "Near-Miss" Problem: Many models got part of the answer right. They looked smart, but they missed one critical detail that made the whole solution useless. It's like building a house with perfect walls but forgetting the roof.

5. The Big Takeaway

This paper is a wake-up call.

We have been celebrating AI for being able to write poems and pass trivia quizzes. But $OneMillion-Bench shows us that when it comes to doing actual, paid professional work, AI is still a bit like a bright intern who hasn't finished their training.

It has the knowledge, but it lacks the reliability, the caution, and the deep understanding required to be trusted with a million-dollar job.

The Bottom Line:
We are moving from the era of "Chatbots that talk" to "Agents that work." This benchmark is the first real ruler to measure if they are ready to get a paycheck. The answer? They are getting there, but they aren't quite ready to replace the experts just yet.

Here is a detailed technical summary of the paper "$OneMillion-Bench: How Far are Language Agents from Human Experts?"

1. Problem Statement

Current benchmarks for Large Language Models (LLMs) and Language Agents are largely confined to structured, exam-style questions (e.g., multiple-choice, static QA) that fail to capture the complexity of real-world professional labor. While models have improved at multi-step reasoning and tool use, there is a critical gap in evaluating their ability to perform economically consequential, context-heavy tasks that require:

Retrieval of authoritative sources.
Resolution of conflicting evidence.
Application of strict domain-specific rules and regulations.
Adherence to professional constraints where the process is as important as the final answer.

Existing benchmarks often lack "economic grounding," making it unclear whether agents can deliver tangible value in high-stakes professional environments (Finance, Law, Healthcare, etc.).

2. Methodology: $OneMillion-Bench ($ 1M-Bench)

The authors introduce $OneMillion-Bench, a comprehensive benchmark designed to evaluate agents based on economic value and expert-level proficiency.

A. Dataset Construction

Scale & Scope: 400 expert-curated, open-ended tasks spanning five high-stakes domains: Finance, Law, Healthcare, Natural Science, and Industry.
Bilingual & Localized: The dataset includes 200 English and 200 Chinese tasks. The Chinese subset is not a direct translation but is tailored to Mainland China's specific regulations, cultural nuances, and industry standards (e.g., Chinese accounting standards, Cybersecurity Law).
Economic Valuation: Each task is assigned a real-world monetary value calculated as:
$V = T_{Expert} \times W_{Hourly}$
Where $T_{Expert}$ is the estimated time for a senior professional to complete the task, and $W_{Hourly}$ is the prevailing market wage (anchored to US BLS data and Chinese tier-1 city guidelines). The total estimated value of all tasks exceeds $1 million USD.
Curation Pipeline: A rigorous three-stage process involving:
1. Task Creation: Domain experts design tasks with reference answers and scoring rubrics. Tasks are validated against frontier agents to ensure they are not trivially solvable.
2. Peer Review: Independent review by a second specialist to ensure clarity, fairness, and alignment with design principles.
3. Resolution/Revision: A third expert resolves disputes and filters tasks that are either too easy (no differentiation) or impossibly hard (mission-impossible).

B. Evaluation Framework

Instead of simple accuracy, the benchmark uses a Rubric-Based Evaluation system:

Expert Score: A weighted sum of scores across specific rubrics (ranging from -20 to +10).
- Positive Weights: Reward factual accuracy, logical coherence, and professional compliance.
- Negative Weights (Penalties): Deduct points for hallucinations, safety violations, ignoring constraints, or using irrelevant information.
Pass Rate: The percentage of tasks where the Expert Score exceeds a threshold (0.7), indicating the agent met a minimum professional standard.
Capability Taxonomy: Tasks are tagged to evaluate specific agent skills:
- Web Search: Fact retrieval and verification.
- Reasoning: Causal attribution and trend analysis.
- Verbalization: Logical flow and professional tone.
- Instruction Following: Adherence to constraints and rules.

3. Key Contributions

Economic-Grounded Evaluation: Shifts the metric from "accuracy" to "economic value delivered," quantifying agent capability in terms of labor cost savings and professional output worth.
Rubric-Based, Process-Oriented Scoring: Moves beyond "right/wrong" answers to evaluate the reasoning process, evidence grounding, and compliance with professional norms.
High-Fidelity Domain Coverage: Covers 5 major industries with 37 sub-domains and 92 third-level categories, featuring tasks that simulate real-world workflows (e.g., auditing IFRS 17 reserves, legal jurisdiction disputes, clinical treatment planning).
Negative Penalty Mechanism: Introduces asymmetric scoring where specific failures (e.g., hallucinations, safety violations) incur heavy penalties, mirroring real-world consequences of professional errors.
Comprehensive Benchmarking: Evaluates 35 models, including vanilla LLMs, search-enabled agents, and specialized "Deep Research" agents, across both Global (English) and CN (Chinese) subsets.

4. Key Results

The paper evaluates 35 models (including GPT-5 variants, Claude-Opus-4.6, Gemini-3, Qwen3.5, and specialized research agents).

Top Performers: Claude-Opus-4.6 emerges as the clear leader in both Vanilla and Search-enabled settings, achieving the highest Expert Score and Pass Rate.
Impact of Web Search:
- Amplification Effect: For top-tier models (e.g., Claude-Opus-4.6, GPT-5.2), web search significantly boosts performance (e.g., +8.1% Expert Score, +7.0% Pass Rate).
- Regression Risk: For weaker or less robust models, search can degrade performance by introducing noise or conflicting evidence (e.g., Hunyuan-2.0 saw a drop in Pass Rate from 8.5% to 3.0% with search).
- Conclusion: Search is a "double-edged sword"; it requires robust evidence filtering and planning to be beneficial.
Deep Research Agents: Specialized agents (e.g., o3-DeepResearch) perform competitively but do not dominate the best search-enabled generalist models. This suggests that robust rubric coverage and compliance are more critical than complex research pipelines alone.
Skill Disparities:
- Models generally perform well on Structure/Formatting and Instruction Following.
- Significant gaps remain in Factual Information and Analytical Reasoning, particularly in Finance and Law.
- Instruction Following is identified as the most fragile capability when tools are introduced; models often deviate from constraints when processing retrieved evidence.
Economic Value vs. Cost: Search-enabled agents achieve a Pareto-optimal trade-off, delivering significantly higher economic value than base models for a marginal increase in inference cost.

5. Significance and Future Directions

Bridging the Gap: $1M-Bench provides a unified testbed to assess the "professional readiness" of AI, moving beyond academic benchmarks to real-world economic viability.
Reliability Gap: The results highlight a significant reliability gap; current models often fail to maintain the consistency and evidence-grounding required for autonomous professional labor.
Failure Patterns: The study identifies specific failure modes, including:
- Over-reliance on search for reasoning-heavy tasks (disrupting causal chains).
- Arithmetic and extraction errors in quantitative tasks.
- Inability to apply specific, localized regulations (e.g., recent legal cases or specific medical guidelines).
Future Work: The authors propose dynamic expansion into more vertical domains (energy, climate), live benchmarking with real-time data, and the automation of fine-grained process evaluation to reduce reliance on manual expert scoring.

In conclusion, $OneMillion-Bench establishes a new standard for evaluating language agents, demonstrating that while progress is being made, true "expert-level" autonomy in high-stakes professional domains remains an elusive goal requiring significant improvements in reasoning, evidence integration, and constraint adherence.

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

1. The Problem: The "Driving Test" vs. The "Taxi Driver"

2. The Solution: The "$1 Million Job Interview"

3. How They Grade the AI: The "Rubric" (The Scorecard)

4. The Results: Who Passed the Test?

5. The Big Takeaway

1. Problem Statement

2. Methodology: OneMillion−Bench(OneMillion-Bench (OneMillion−Bench(1M-Bench)

A. Dataset Construction

B. Evaluation Framework

3. Key Contributions

4. Key Results

5. Significance and Future Directions

More like this

Equitable Multi-Task Learning for AI-RANs

SPREAD: Subspace Representation Distillation for Lifelong Imitation Learning

The Temporal Markov Transition Field

SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients

Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models

2. Methodology: $OneMillion-Bench ($ 1M-Bench)