GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

The Problem: The "Generalist" Who Doesn't Know the Neighborhood

Imagine you hire a brilliant, super-smart personal assistant (the GUI Agent) who has read every book in the library. They are great at logic, reading instructions, and understanding general concepts.

However, you ask them to fix a specific, slightly weird setting in a piece of software they've never used before (like adjusting the contrast in GIMP, a photo editor).

Because they only have "general" knowledge, they get stuck. They might:

Plan wrong: They know what "contrast" means, but they look for the button in the wrong menu (like looking in the "Image" menu instead of the "Colors" menu, which is where GIMP hides it).
Get lost: They see a bunch of sliders and don't know which one is the "Contrast" slider and which one is "Brightness."

This is called Domain Bias. The agent is smart, but it lacks the specific "local knowledge" of that specific software. Usually, to fix this, you'd have to hire a human to write a manual for every single app, or retrain the AI from scratch (which is expensive and slow).

The Solution: GUIDE (The "On-Demand Librarian")

The authors created a framework called GUIDE. Instead of retraining the AI, GUIDE acts like a super-fast, on-demand librarian that runs alongside the agent.

When the agent gets stuck on a task, GUIDE doesn't guess. It goes out to the internet (specifically YouTube), finds a real human tutorial video for that exact task, and instantly translates that video into a cheat sheet for the agent.

Here is how GUIDE works in three simple steps:

1. The Detective (Retrieval Agent)

The Analogy: Imagine you ask a detective, "How do I change the brightness in GIMP?"
What GUIDE does: Instead of just searching for the word "GIMP," the detective looks at the subtitles of thousands of YouTube videos.
The Magic: It filters out boring vlogs or theoretical lectures. It finds the exact video where someone says, "Click on the Colors menu, then select Brightness-Contrast." It ignores videos that don't actually show the mouse clicking.
Result: It picks the top 1 or 2 best videos that match your specific problem.

2. The Translator (Annotation Agent)

The Analogy: Now that we have the video, we need to turn it into a recipe. But we can't just say "Click here" because the screen might look different today.
What GUIDE does: It uses a special "Time-Travel" method (called Inverse Dynamics). It looks at two frames of the video: Before the click and After the click.
- It asks: "What changed? What button was pressed? What text appeared?"
- It ignores the boring parts (like the video intro or the person talking).
- It translates the visual action into Natural Language instructions.
The Output: It creates two types of notes:
- Planning Notes: "First, go to the Colors menu. Then, look for the slider." (The Strategy)
- Grounding Notes: "The slider is a horizontal bar labeled 'Contrast' located right under the 'Brightness' slider." (The Visual Clues)

3. The Coach (Integration)

The Analogy: The agent is playing a video game. GUIDE whispers the "Walkthrough" into the agent's ear while they are playing.
What GUIDE does: It doesn't force the agent to follow the notes blindly. It says, "Hey, here is a hint from a tutorial video. Does this match what you see on your screen? If yes, use it. If no, ignore it and trust your own eyes."
Result: The agent suddenly knows the secret menu paths and can spot the right buttons instantly.

Why This is a Big Deal

No Re-training: You don't need to teach the AI a new language. You just give it a cheat sheet. It works with any AI agent, big or small.
Always Up-to-Date: Software changes all the time. If GIMP updates its menu tomorrow, GUIDE will just find a new YouTube video made yesterday and update the cheat sheet instantly. Old training data would be useless; GUIDE is always fresh.
Cheap and Fast: Instead of paying humans to annotate thousands of screenshots, GUIDE automates the whole process using free internet videos.

The Bottom Line

GUIDE solves the problem of "Smart AI, but clueless about specific apps" by giving the AI a real-time tutor. It finds a video of a human doing the task, translates that video into a text-based strategy guide, and hands it to the AI just in time to help it succeed.

It turns the entire internet of tutorial videos into a massive, free, and instant training ground for robots.

1. Problem Statement: Domain Bias in GUI Agents

While Large Vision-Language Models (VLMs) have enabled GUI agents to understand interfaces and execute tasks generally, they suffer from significant domain bias when applied to specific software applications. This bias manifests in two critical ways:

Planning-Level Bias: The agent lacks familiarity with the specific operational workflows of an application. For example, it may know the concept of "adjusting brightness" but fail to know that in GIMP, this is done via Colors > Brightness-Contrast, whereas in Photoshop, it is under Image > Adjustments.
Grounding-Level Bias: The agent struggles to locate specific UI elements within a unique application's layout. It may recognize a "menu" conceptually but fail to identify the specific "Colors" entry or the "brightness slider" within a specific dialog box.

Existing solutions like manual annotation, rule-based systems, or domain-specific fine-tuning are costly, narrow in coverage, and cannot keep pace with rapidly evolving software interfaces.

2. Methodology: The GUIDE Framework

GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise) is a training-free, plug-and-play framework that resolves domain bias by autonomously acquiring expertise from web tutorial videos (specifically YouTube) at inference time. It operates via three collaborating agents:

A. Subtitle-Driven Video-RAG Pipeline (Retrieval Agent)

Instead of relying on noisy video titles, GUIDE leverages subtitles as a semantic bridge to find relevant tutorials. It employs a progressive three-stage filtering process:

Domain Classification: Filters out non-GUI content (e.g., vlogs, lectures) by analyzing subtitles for operational verbs (e.g., "click," "select") rather than just titles.
Topic Extraction: Generates a precise semantic descriptor of the task by analyzing the combination of the video title and subtitle content, correcting for misleading titles.
Relevance Matching: Uses a dual-anchored prompt (repeating the extracted topic) to match the task instruction against the video content, selecting the top- $K$ (usually $K \le 2$ ) most relevant videos.

B. Fully Automated Annotation Pipeline (Annotation Agent)

Once relevant videos are retrieved, GUIDE converts them into structured, transferable knowledge using an Inverse Dynamics paradigm:

Keyframe Extraction: Uses audio timestamps (Whisper) and visual change detection (MOG2 algorithm) to extract discrete state pairs ( $s_t, s_{t+1}$ ).
UI Element Detection: Processes keyframes with OmniParser to generate structured graphs of UI elements (bounding boxes, types, labels).
Inverse Dynamics Inference ( $f_{IDM}$ ): A VLM analyzes consecutive keyframe pairs, UI element graphs, and subtitle context to infer the action taken. Crucially, it filters out "meaningless" transitions (e.g., idle frames) and generates a Strategic Narrative ("Thought & Action NLP") that explains why an action was taken, not just what was done.
Knowledge Decomposition: The inferred trajectories are decomposed into two knowledge types:
- Planning Knowledge: A coordinate-free, step-by-step execution flow and expert insights (e.g., "In GIMP, contrast is under Colors, not Image").
- Grounding Knowledge: Descriptions of up to 15 key UI elements, focusing on visual appearance, relative position, and predicted function rather than absolute coordinates.

C. Plug-and-Play Agent Integration

The extracted knowledge is injected into the target GUI agent without modifying its weights or architecture.

Mode A (Multi-Agent): Planning knowledge guides the "Worker" agent's task decomposition; Grounding knowledge assists the "Grounding Agent" in coordinate prediction.
Mode B (Single-Model): Both knowledge streams are injected into the system prompt as "External Knowledge," guiding the model's Chain-of-Thought (CoT) reasoning. The agent is instructed to treat this as a reference, verifying it against the current screenshot to handle version differences.

3. Key Contributions

Autonomous Learning Paradigm: A novel approach that bridges the gap between general VLM capabilities and domain-specific tasks using live web resources, eliminating the need for manual annotation or fine-tuning.
Subtitle-Driven Video-RAG: A retrieval pipeline that utilizes subtitles for progressive semantic filtering, achieving significantly higher precision than title-based keyword matching.
Inverse Dynamics Annotation: A fully automated pipeline that produces transferable, coordinate-free planning and grounding knowledge, addressing both manifestations of domain bias.
Architecture-Agnostic Design: The framework functions as a modular component for both single-model agents (e.g., Qwen3-VL, Seed-1.8) and multi-agent systems (e.g., AgentS3).

4. Experimental Results

The framework was evaluated on OSWorld, a benchmark of 369 real-world computer tasks across 10 application domains.

Performance Gains: GUIDE consistently improved performance across three different agent architectures:
- Seed-1.8: +7.48% overall improvement.
- Qwen3-VL-8B: +5.83% overall improvement.
- AgentS3 (Multi-Agent): +4.47% overall improvement.
Comparison: It outperformed the closest related work, Watch & Learn, which achieved only +2.2% under comparable settings.
Ablation Studies:
- Planning vs. Grounding: Planning knowledge contributed ~86–91% of the total gain, indicating that workflow reasoning is the primary bottleneck. Grounding knowledge provided complementary gains, particularly in domains with complex UIs (e.g., GIMP, Calc).
- Element Count: Performance peaked when injecting ~5 grounding elements per task, balancing information richness with noise.
Efficiency: While per-step latency increased slightly due to knowledge injection, the total number of execution steps decreased for successful tasks, leading to a net reduction in wall-clock time for complex tasks.

5. Significance and Conclusion

GUIDE represents a significant shift in how GUI agents acquire domain expertise. By treating the internet's vast repository of tutorial videos as a dynamic knowledge base, it solves the "domain bias" problem without the prohibitive costs of data collection and model retraining.

Scalability: It offers a scalable solution that adapts to new software versions or applications in real-time.
Cost-Effectiveness: The API cost for the full pipeline is approximately $115 for the entire OSWorld benchmark (mostly driven by annotation), which is orders of magnitude cheaper than manual annotation and fine-tuning.
Robustness: While the system can fail if the retrieved video is procedurally mismatched (e.g., a tutorial for a different software version), the "Reference, not directive" integration strategy ensures the agent prioritizes its own visual observations, maintaining robustness against domain shifts.

In summary, GUIDE demonstrates that Retrieval-Augmented Generation (RAG) applied to video content, combined with inverse dynamics inference, is a powerful, training-free mechanism to align general-purpose AI agents with specific, real-world software tasks.