OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

This paper introduces OSWorld-Human, a manually annotated benchmark revealing that current computer-use agents suffer from prohibitive latency and inefficiency, primarily due to excessive model calls for planning and reflection, resulting in them taking 2.7 to 4.3 times more steps than humans to complete tasks.

Original authors: Reyna Abhyankar, Qi Qi, Yiying Zhang

Published 2026-05-19
📖 5 min read🧠 Deep dive

Original authors: Reyna Abhyankar, Qi Qi, Yiying Zhang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you hire a very smart, but incredibly slow, personal assistant to do a simple task on your computer, like changing the font size in a document. You expect it to take 30 seconds. Instead, your assistant takes 12 minutes. Why?

This paper, OSWorld-Human, investigates exactly that problem. It looks at "Computer-Use Agents" (AI programs that control your mouse and keyboard) and asks: "Why are they so slow, and how can we make them faster?"

Here is the breakdown of their findings using simple analogies.

1. The Problem: The "Over-Thinker" Assistant

The researchers found that while these AI agents are getting better at getting the job done (accuracy), they are terrible at how fast they do it (efficiency).

  • The Human vs. The Robot: A human expert can change line spacing in a document in under 30 seconds. The AI agent takes 12 minutes.
  • The Bottleneck: The delay isn't because the AI is clicking the mouse slowly. It's because the AI stops to think constantly.
    • The Analogy: Imagine you are driving to the grocery store. A human driver just drives. This AI driver stops at every single intersection to call a super-intelligent consultant on the phone to ask, "Is this the right turn? Should I turn left? Did I turn left correctly? What about the next turn?"
    • The Reality: The AI calls a massive "brain" (a Large Language Model) to plan the next move, judge if the move worked, and reflect on what happened. These phone calls take the most time. In fact, just the "planning" and "judging" steps take up about 75% to 96% of the total time!

2. The Investigation: Breaking Down the Steps

The researchers watched two popular AI agents (Agent S2 and GTA1) try to solve 39 different computer tasks. They broke down every second of the process.

  • The "Thinking" Tax: Every time the AI takes a step, it has to send a huge message to its brain. As the task gets longer, the message gets bigger because it has to include the history of everything it did before.
    • Analogy: It's like writing a diary entry. On day 1, you write one sentence. On day 50, you have to rewrite the whole book from day 1 just to add one new sentence. This makes the later steps take 3 times longer than the early steps.
  • The "Wrong Turn" Loop: Sometimes, the AI knows what to do (e.g., "Click the Open button"), but it can't find where to click on the screen. It clicks the wrong spot, realizes it's wrong, and tries again.
    • The Cost: This "getting stuck" happens often. In one case, an AI tried to open a folder and got stuck in a loop for 27 minutes, wasting $8.47 in computing costs, just because it couldn't find the right spot on the screen.

3. The Solution: The "Human Benchmark" (OSWorld-Human)

To prove how inefficient these robots are, the researchers created a new benchmark called OSWorld-Human.

  • What is it? They manually went through all 369 tasks and wrote down the shortest, most logical path a human would take to finish them.
  • The "Grouping" Trick: They noticed that humans often do several things in one go without stopping to think.
    • Analogy: A human sees a text box, types a name, and hits "Enter" in one smooth motion. The AI, however, stops after seeing the box, calls its brain to plan typing, stops again to plan hitting "Enter," and calls its brain again.
    • The Finding: The researchers found that many actions could be "grouped" together. If the AI could do three clicks in one "thought," it would save massive amounts of time.

4. The Verdict: The Robots Are Wasting Steps

The researchers tested 16 different AI agents against their new "Human Benchmark."

  • The Result: Even the best AI agents took 2.7 to 4.3 times more steps than a human needed to finish the same task.
  • The Score: They created a new score called the Weighted Efficiency Score (WES).
    • If an AI gets the job done but takes 4 times longer than necessary, its score drops drastically.
    • The top AI on the leaderboard, which looked great on old tests, only scored 15.6% on this new efficiency test. This means it is doing a lot of unnecessary work.

Summary

The paper concludes that current AI computer agents are like brilliant but clumsy students. They know the answer, but they spend so much time raising their hand, waiting for the teacher, and re-reading their notes that they never finish the test on time.

To fix this, the paper suggests three main things:

  1. Stop over-thinking: Reduce the number of times the AI calls its "brain" to plan and judge.
  2. Group actions: Let the AI do multiple steps (like typing and hitting enter) in a single thought.
  3. Get better at finding things: Improve the AI's ability to see exactly where to click so it doesn't get stuck in loops.

The goal isn't just to make AI smarter, but to make it faster and more practical for real-world use.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →