Efficient Agent Training for Computer Use

The paper introduces PC Agent-E, an efficient training framework that synthesizes diverse action decisions using Claude 3.7 Sonnet to augment a small set of 312 human trajectories, resulting in a model that significantly outperforms both human-only training and direct distillation on the WindowsAgentArena-V2 benchmark.

Yanheng He, Jiahe Jin, Pengfei Liu

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you want to teach a robot to use a computer just like a human does—clicking buttons, typing in forms, and navigating menus. The biggest problem has always been that teaching a robot requires showing it thousands of examples of humans doing these tasks. It's like trying to teach a child to drive by only letting them watch one person drive for a few minutes; they won't get the hang of it.

This paper introduces PC Agent-E, a new way to train these computer-using robots that is incredibly efficient. Here is the story of how they did it, explained simply.

The Problem: The "Data Famine"

Usually, to make a smart AI, you need a massive library of "human demonstrations" (videos or logs of people using computers). But getting thousands of high-quality examples is expensive, slow, and hard to do. The authors started with a tiny library: just 312 examples recorded by two people in a single day.

The Solution: The "Smart Tutor" Method

Instead of just memorizing those 312 examples, the authors used a clever trick they call Trajectory Boost. Think of it like this:

  1. The Raw Footage (Human Data): They recorded 312 times when humans successfully used a computer. This is the "trunk" of the tree.
  2. The Inner Monologue (Thought Completion): Humans don't usually say out loud why they clicked a button. The AI took those raw recordings and used a super-smart AI (Claude 3.7 Sonnet) to write down the "inner thoughts" the human must have had. Now, the robot knows not just what the human did, but why.
  3. The "What If" Game (Trajectory Boost): This is the magic step. The authors asked the super-smart AI: "Okay, we know the human clicked 'Save' here. But what are 9 other valid ways to solve this problem? Maybe they could have clicked 'File' then 'Save', or used a keyboard shortcut?"
    • The AI generated 9 alternative paths for every single step the human took.
    • Suddenly, those 312 examples turned into 27,000 training examples.

The Result: A Super-Student

They trained their robot (PC Agent-E) on this massive, enriched dataset. The results were shocking:

  • The Base Model: A standard AI model (Qwen2.5-VL-72B) could only solve about 15% of the computer tasks.
  • The New Robot: PC Agent-E solved 36% of the tasks.
  • Beating the Teacher: Even more impressively, this small, open-source robot performed better than the giant, expensive "teacher" AI (Claude 3.7 Sonnet) that was used to generate the extra data.

Why This is a Big Deal

The authors compared their method to two other ways of training:

  1. Just watching humans: This gave a tiny improvement.
  2. Direct Distillation (Copying the Teacher): This is where you just ask the smart AI to do the whole task and copy its answers. This is slow, expensive, and prone to errors (like copying a mistake).

The PC Agent-E method was 300 times faster than the "Direct Distillation" method. Instead of asking the smart AI to actually do the tasks on a real computer (which takes hours), they just asked it to imagine the steps offline. It's the difference between hiring a master chef to cook 3,000 meals for you (expensive and slow) versus hiring them to write down 3,000 variations of a recipe (fast and cheap), and then teaching your apprentice from those recipes.

The New "Driving Test" (WindowsAgentArena-V2)

The authors also realized that the existing tests for these robots were flawed. Some tests were impossible to pass (like trying to use a feature that doesn't exist), and robots could "cheat" by just saying "I failed" to get a perfect score. They built a new, fairer test called WindowsAgentArena-V2 to ensure robots are actually learning to use computers, not just gaming the system.

The Bottom Line

You don't need a million human videos to teach a robot to use a computer. If you have a small, high-quality set of human examples and a smart AI to help you imagine all the different ways to solve a problem, you can train a robot that is smarter than the AI you used to teach it.

It's like teaching a student not just by showing them one way to solve a math problem, but by having a genius tutor explain the logic and then brainstorming ten different ways to solve it, ensuring the student truly understands the concept rather than just memorizing the answer.