Imagine you hire a very smart, but incredibly slow, personal assistant to do a simple task on your computer, like changing the font size in a document. You expect it to take 30 seconds. Instead, your assistant takes 12 minutes. Why?

This paper, OSWorld-Human, investigates exactly that problem. It looks at "Computer-Use Agents" (AI programs that control your mouse and keyboard) and asks: "Why are they so slow, and how can we make them faster?"

Here is the breakdown of their findings using simple analogies.

1. The Problem: The "Over-Thinker" Assistant

The researchers found that while these AI agents are getting better at getting the job done (accuracy), they are terrible at how fast they do it (efficiency).

The Human vs. The Robot: A human expert can change line spacing in a document in under 30 seconds. The AI agent takes 12 minutes.
The Bottleneck: The delay isn't because the AI is clicking the mouse slowly. It's because the AI stops to think constantly.
- The Analogy: Imagine you are driving to the grocery store. A human driver just drives. This AI driver stops at every single intersection to call a super-intelligent consultant on the phone to ask, "Is this the right turn? Should I turn left? Did I turn left correctly? What about the next turn?"
- The Reality: The AI calls a massive "brain" (a Large Language Model) to plan the next move, judge if the move worked, and reflect on what happened. These phone calls take the most time. In fact, just the "planning" and "judging" steps take up about 75% to 96% of the total time!

2. The Investigation: Breaking Down the Steps

The researchers watched two popular AI agents (Agent S2 and GTA1) try to solve 39 different computer tasks. They broke down every second of the process.

The "Thinking" Tax: Every time the AI takes a step, it has to send a huge message to its brain. As the task gets longer, the message gets bigger because it has to include the history of everything it did before.
- Analogy: It's like writing a diary entry. On day 1, you write one sentence. On day 50, you have to rewrite the whole book from day 1 just to add one new sentence. This makes the later steps take 3 times longer than the early steps.
The "Wrong Turn" Loop: Sometimes, the AI knows what to do (e.g., "Click the Open button"), but it can't find where to click on the screen. It clicks the wrong spot, realizes it's wrong, and tries again.
- The Cost: This "getting stuck" happens often. In one case, an AI tried to open a folder and got stuck in a loop for 27 minutes, wasting $8.47 in computing costs, just because it couldn't find the right spot on the screen.

3. The Solution: The "Human Benchmark" (OSWorld-Human)

To prove how inefficient these robots are, the researchers created a new benchmark called OSWorld-Human.

What is it? They manually went through all 369 tasks and wrote down the shortest, most logical path a human would take to finish them.
The "Grouping" Trick: They noticed that humans often do several things in one go without stopping to think.
- Analogy: A human sees a text box, types a name, and hits "Enter" in one smooth motion. The AI, however, stops after seeing the box, calls its brain to plan typing, stops again to plan hitting "Enter," and calls its brain again.
- The Finding: The researchers found that many actions could be "grouped" together. If the AI could do three clicks in one "thought," it would save massive amounts of time.

4. The Verdict: The Robots Are Wasting Steps

The researchers tested 16 different AI agents against their new "Human Benchmark."

The Result: Even the best AI agents took 2.7 to 4.3 times more steps than a human needed to finish the same task.
The Score: They created a new score called the Weighted Efficiency Score (WES).
- If an AI gets the job done but takes 4 times longer than necessary, its score drops drastically.
- The top AI on the leaderboard, which looked great on old tests, only scored 15.6% on this new efficiency test. This means it is doing a lot of unnecessary work.

Summary

The paper concludes that current AI computer agents are like brilliant but clumsy students. They know the answer, but they spend so much time raising their hand, waiting for the teacher, and re-reading their notes that they never finish the test on time.

To fix this, the paper suggests three main things:

Stop over-thinking: Reduce the number of times the AI calls its "brain" to plan and judge.
Group actions: Let the AI do multiple steps (like typing and hitting enter) in a single thought.
Get better at finding things: Improve the AI's ability to see exactly where to click so it doesn't get stuck in loops.

The goal isn't just to make AI smarter, but to make it faster and more practical for real-world use.

Technical Summary: OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

1. Problem Statement

While generative AI-driven Computer-Use Agents (CUAs) have demonstrated increasing accuracy on benchmarks like OSWorld, they suffer from a critical bottleneck: extreme end-to-end latency. State-of-the-art systems often require tens of minutes to complete tasks that human experts can finish in seconds or minutes. This disparity renders current agents practically unusable for interactive or time-sensitive workflows. Prior research has focused predominantly on improving task success rates, largely neglecting the temporal efficiency required for real-world deployment. There is a lack of systematic understanding regarding where this latency originates (e.g., specific agent steps, model calls) and how many steps are truly necessary compared to human execution.

2. Methodology

The authors conducted the first systematic study on the temporal performance of CUAs using the OSWorld benchmark, which covers 369 tasks across 9 applications (e.g., Chromium, GIMP, LibreOffice, VS Code) on Ubuntu, Windows, and MacOS.

2.1 Latency and Step Analysis

The study performed a detailed breakdown of agent trajectories using two leading open-source frameworks: Agent S2 and GTA1.

Instrumentation: The authors profiled specific steps including information retrieval, step planning, step grounding (finding coordinates), action execution, screenshotting, judging, and reflection.
Model Configuration: Agent S2 utilized GPT-4.1 for planning and reflection, with UI-TARS-7B-DPO for grounding. GTA1 utilized o3 for planning and judging, with GTA1-7B for grounding.
Observation Modalities: The study analyzed the impact of different perception inputs: raw screenshots, Accessibility (A11y) trees, and Set-of-Marks (SoM).
Cost and Failure Analysis: The authors analyzed token counts, financial costs, and failure modes (specifically grounding errors leading to loops) for GTA1.

2.2 Construction of OSWorld-Human

To establish a ground truth for efficiency, the authors constructed OSWorld-Human, a manually annotated dataset containing:

Minimal Human Trajectories: For all 369 OSWorld tasks, human annotators determined the minimal number of steps required for successful completion based on verified ground-truth sources.
Action Grouping: The dataset includes "grouped-action" trajectories where multiple consecutive actions (e.g., click, type, press enter) that can be executed from a single visual observation are consolidated into a single step. This identifies opportunities to reduce Large Language Model (LLM) calls.

2.3 Evaluation Metric: Weighted Efficiency Score (WES)

The authors proposed a new metric, Weighted Efficiency Score (WES), to evaluate agents against human trajectories.

Formula: $WES = WES^+ \cdot (1 - \frac{\bar{t}_{fail}}{S})$ $W E S = W E S^{+} \cdot (1 - \frac{t ˉ _{f ai l}}{S})$
- $WES^+$ : The average ratio of human steps to agent steps for successful tasks ( $t_{human} / t_{agent}$ ).
- Penalty Multiplier: Penalizes agents that fail tasks after consuming a large portion of their step budget ( $\bar{t}_{fail}$ is the average steps taken on failed tasks; $S$ is the max steps allowed).
Purpose: This metric rewards agents that succeed with fewer steps and penalizes those that fail inefficiently, providing a holistic view of accuracy and temporal efficiency.

3. Key Findings and Results

3.1 Sources of Latency

LLM Calls are the Bottleneck: Large model calls for planning, reflection, and judging account for the vast majority of latency.
- In Agent S2, planning and reflection account for 76%–96% of total task latency.
- In GTA1, planning and judging account for 91%–96% of total latency.
Step-Dependent Latency: As an agent takes more steps to complete a task, the latency of each successive step increases. This is due to the "history accumulation" mechanism where prompts include all previous steps, leading to longer context windows and higher token counts.
Observation Impact: Including A11y trees drastically increases latency (due to tree generation time and token volume) and often increases the number of steps required, though it can reduce steps for specific applications like GIMP.

3.2 Efficiency Gaps

Inefficiency of Current Agents: Even the best-performing agents take 2.7× to 4.3× more steps than necessary to complete a task compared to human trajectories.
WES Scores: The top-performing agent on the OSWorld leaderboard (Agent S2 w/ Gemini 2.5) achieved a success rate of 41.4% but only a 15.6% WES on single-action trajectories and 9.6% on grouped-action trajectories.
Failure Analysis: For GTA1, 23% of failures were due to grounding errors where the planning was correct, but the grounding model generated incorrect coordinates, causing the agent to repeat steps or enter loops. In some cases, agents wasted 72+ steps in a single loop.

3.3 Cost Implications

The computational cost is heavily skewed toward planning. For GTA1, planning accounts for 87% of the total cost, while judging accounts for 13%. The cost scales quadratically with the number of steps due to parallel planning rollouts and retry mechanisms.

4. Key Contributions

First Systematic Latency Study: A detailed analysis of the temporal performance of CUAs (Agent S2 and GTA1) on the OSWorld benchmark, identifying LLM calls as the primary bottleneck.
OSWorld-Human Dataset: A manually curated, cross-verified benchmark containing optimal human trajectories and grouped-action trajectories for all 369 OSWorld tasks.
New Metric (WES): The proposal of the Weighted Efficiency Score to evaluate agents based on both success rate and step efficiency, penalizing inefficient failures.
Action Grouping Analysis: Demonstration that grouping actions executable from a single observation can significantly reduce the number of required LLM calls and steps.
Comprehensive Evaluation: An evaluation of 16 state-of-the-art agents using OSWorld-Human, revealing that current systems are significantly less efficient than human benchmarks.
Failure and Cost Analysis: Identification of grounding errors as a major cause of wasted steps and loops, and a breakdown of the financial costs associated with current agent architectures.

5. Significance and Claims

The paper claims to establish the first unified framework for benchmarking, analyzing, and improving the temporal efficiency of computer-use agents. The authors argue that while accuracy is important, temporal efficiency is equally crucial for real-world usability.

The study highlights that current agents are practically unusable in interactive scenarios due to high latency and excessive step counts. By providing OSWorld-Human and the WES metric, the authors aim to guide future developments toward:

Reducing the latency of planning, judging, and reflection calls.
Minimizing the number of steps per task through action grouping.
Improving grounding mechanisms to prevent loops and coordinate errors.
Compressing history to manage token costs and latency.

The authors conclude that their work and the open-sourced dataset will foster new research directions to bridge the gap between current agent capabilities and the efficiency required for practical deployment.

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents