\$OneMillion-Bench: How Far are Language Agents from Human Experts?

The paper introduces \$OneMillion-Bench, a novel benchmark comprising 400 expert-curated tasks across five professional domains designed to rigorously evaluate the reliability, reasoning depth, and practical readiness of language agents in complex, real-world scenarios that existing benchmarks fail to address.

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, Yang Liu, Tao Peng, Yixin Ren, Ran Tian, Zaiyuan Wang, Yanglihong Xiao, Gang Yao, Lingyue Yin, Ge Zhang, Chun Zhang, Jianpeng Jiao, Zilong Zheng, Yuan Gong

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you've spent years teaching a robot to take tests. It can ace a multiple-choice quiz on history, solve a math problem, or write a poem about a cat. You think, "Wow, this robot is a genius! It's ready for the real world!"

But then, you hand it a real job: "Go fix a leaking pipe in a 100-year-old building while the owner is watching, and don't break the antique tiles." Suddenly, the robot freezes. It knows what a pipe is, but it doesn't know how to be a plumber in a messy, high-stakes situation.

That is exactly the problem the paper $OneMillion-Bench is trying to solve.

Here is the story of the paper, broken down into simple concepts and analogies.

1. The Problem: The "Driving Test" vs. The "Taxi Driver"

For a long time, we tested AI (Language Models) like we test drivers in a driving school. We put them in a simulator with perfect weather, clear lines, and no other cars. They pass the test easily.

But in the real world, being a professional (like a lawyer, a doctor, or a financial analyst) is like driving a taxi in a chaotic city during a storm. You have to:

  • Find the right map (search for information).
  • Ignore fake road signs (spot conflicting evidence).
  • Follow strict traffic laws (comply with regulations).
  • Make decisions that cost money if you get it wrong.

Current AI benchmarks are like the driving school simulator. They don't tell us if the AI can actually handle the chaos of a real job.

2. The Solution: The "$1 Million Job Interview"

The authors created a new test called $OneMillion-Bench.

Instead of asking the AI to solve a math equation, they gave it 400 real-world professional tasks. These aren't made-up questions; they are things a senior expert would actually do.

  • The Lawyer: "Check this contract for a loophole that could cost us millions."
  • The Doctor: "Design a treatment plan for a patient with a rare condition, considering their specific insurance rules."
  • The Engineer: "Fix this code that is crashing our server."

Why is it called "$1 Million"?
The researchers calculated how much it would cost to hire a human expert to do these tasks. They added up the hours and the hourly wage of top professionals in the US and China. The total value of all the tasks in the test is over $1 million.

Think of it this way: If an AI can do these tasks, it's not just "smart"; it's economically valuable. It's like saying, "This robot can earn you a million dollars in work."

3. How They Grade the AI: The "Rubric" (The Scorecard)

In school, you get a grade based on the final answer (Right or Wrong). In the real world, how you get the answer matters just as much as the answer itself.

If a lawyer gives you the right verdict but cites a law that was repealed 20 years ago, they are still in trouble.

So, the researchers created a Rubric (a detailed scorecard) for every task. It's like a judge's checklist:

  • Did you find the right source? (Factual Accuracy)
  • Did you follow the rules? (Professional Compliance)
  • Is your logic sound? (Reasoning)
  • Did you make a dangerous mistake? (Negative Penalties)

They even gave negative points for bad behavior. If the AI hallucinates (makes things up) or ignores a safety rule, it gets docked points, just like a driver getting a ticket for speeding.

4. The Results: Who Passed the Test?

The researchers tested 35 different AI models (the "contestants") on this million-dollar job interview. Here's what they found:

  • The "Search" Superpower: The best performers were the ones allowed to use Web Search. It's like giving the driver a GPS and a radio to check traffic. However, for some weaker models, the GPS just confused them, and they crashed.
  • The "Deep Research" Trap: Some models were built specifically to do long, deep research. They did okay, but they didn't beat the smartest general-purpose models that were just given a good search tool. It turns out, being a "generalist with a good tool" is often better than being a "specialist with a bad tool."
  • The Gap is Huge: Even the best AI (Claude-Opus-4.6) only passed about 43% of the tasks. That means for more than half the jobs, the AI wasn't ready to work alone yet. It needs a human supervisor.
  • The "Near-Miss" Problem: Many models got part of the answer right. They looked smart, but they missed one critical detail that made the whole solution useless. It's like building a house with perfect walls but forgetting the roof.

5. The Big Takeaway

This paper is a wake-up call.

We have been celebrating AI for being able to write poems and pass trivia quizzes. But $OneMillion-Bench shows us that when it comes to doing actual, paid professional work, AI is still a bit like a bright intern who hasn't finished their training.

It has the knowledge, but it lacks the reliability, the caution, and the deep understanding required to be trusted with a million-dollar job.

The Bottom Line:
We are moving from the era of "Chatbots that talk" to "Agents that work." This benchmark is the first real ruler to measure if they are ready to get a paycheck. The answer? They are getting there, but they aren't quite ready to replace the experts just yet.