Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

Imagine you are trying to teach a very smart but slightly forgetful robot how to complete a complex, multi-step chore, like "Organize a business meeting, send the invite, and set an alarm."

If you ask a standard AI (a "Single-Agent") to do this, it tries to be the CEO, the Manager, and the Janitor all at the same time. It has to figure out the big picture strategy while trying to find the exact pixel on the screen to click.

The Problem: It gets overwhelmed. It forgets where it was in the process (like, "Did I already copy the link?"), and it gets confused between high-level planning and low-level clicking. It's like trying to write a novel while simultaneously tying your shoelaces; you'll likely trip over your laces or forget the plot.

This paper introduces a new way to train AI called CES (Coordinator-Executor-State Tracker). Think of it as hiring a specialized team instead of one overworked superhero.

The Team: A Three-Person Orchestra

Instead of one brain doing everything, the authors split the job into three distinct roles:

The Coordinator (The General):
- Role: This is the strategist. It looks at the big goal ("Organize a meeting") and the current situation, then breaks it down into tiny, simple orders like "Click the Zoom icon" or "Type 'Meeting'."
- Analogy: It's like a military general looking at a map and saying, "We need to take the hill. First, send the scouts. Then, move the tanks." It doesn't touch the tanks; it just gives the orders.
The Executor (The Soldier):
- Role: This is the muscle. It only does one thing: look at the screen and click exactly where the General told it to. It doesn't need to know why it's clicking or what the final goal is.
- Analogy: It's the soldier who just follows the order "Move to coordinate X, Y." It's fast, precise, and doesn't get distracted by the big picture.
The State Tracker (The Scribe):
- Role: This is the memory keeper. Long tasks involve switching between many apps (Zoom, Email, Tumblr, Clock). Screenshots are bad at showing progress because they look the same (e.g., the "Home" screen looks the same whether you are at step 1 or step 10). The Scribe reads what the Soldier did and writes a short, clear summary: "We have copied the link from Zoom. Next, we need to open Tumblr."
- Analogy: Imagine a scribe in a medieval castle. Every time the King (Coordinator) makes a decision, the scribe updates the scroll. If the King gets distracted, the scribe can say, "Sire, we are currently in the kitchen, not the throne room, and we have already found the bread."

The Secret Sauce: "Learning by Doing" (Reinforcement Learning)

How do they teach this team? They don't just show them examples (which is expensive and slow). Instead, they use a method called Staged Execution-Feedback Reinforcement Learning.

Think of it like training a sports team with a frozen coach:

Freeze the Soldier: They take a pre-trained, very good "Executor" (the Soldier) and lock its brain so it doesn't change. This Soldier is the "truth" or the "judge."
Train the General and Scribe: The General and Scribe try to give orders and write summaries.
The Feedback Loop: The frozen Soldier tries to follow the orders.
- If the Soldier succeeds, the General and Scribe get a "Good Job!" (Reward).
- If the Soldier fails (e.g., clicks the wrong button), the General and Scribe get a "Try again" (Penalty).
The Magic: Because the Soldier is frozen and reliable, the General and Scribe learn exactly what kind of instructions and summaries lead to success. They learn to speak the Soldier's language perfectly.

Why This Matters

The paper proves that by separating the roles:

The General gets really good at planning without getting confused by the details.
The Scribe gets really good at remembering the context, so the team never loses its place in a long task.
The Soldier stays focused on just clicking.

The Result: The system can handle "Long-Horizon" tasks (complex, multi-step jobs) much better than previous methods. It's like upgrading from a one-person band trying to play a symphony to a full orchestra where every musician knows their part perfectly.

In short: Don't ask one AI to do everything. Give it a General to plan, a Scribe to remember, and a Soldier to act, and train them to work together.

1. Problem Statement

The paper addresses the critical limitations of current Graphical User Interface (GUI) agents when handling long-horizon tasks (complex tasks requiring many sequential steps across multiple applications). The authors identify two fundamental challenges in existing single-agent architectures:

Responsibility Coupling and Capability Conflict: Current end-to-end models attempt to unify high-level strategic planning (task decomposition, reasoning) and low-level execution (visual grounding, precise action) within a single policy network. This creates an optimization conflict where the model struggles to master both abstract reasoning and precise pixel-level control simultaneously, leading to "catastrophic collapse" as task complexity increases.
Lack of Task State Awareness: Long-horizon tasks require the agent to track progress over time. Existing methods rely on raw screenshots or low-level action histories (e.g., "Click x,y") as context. The authors demonstrate via preliminary experiments that screenshots are insufficient for state representation; agents fail to determine their position in a task timeline when facing repeated screens (e.g., Home screens) or out-of-distribution (OOD) interfaces, leading to "state loss" and task failure.

2. Methodology: The CES Framework

To resolve these issues, the authors propose the Coordinator-Executor-State Tracker (CES) framework, a multi-agent system that structurally decouples the automation process into three specialized roles, inspired by a modern operating system (CPU, I/O, and Memory).

A. Agent Roles

Coordinator (The "CPU"):
- Role: Strategic planning and task decomposition.
- Input: High-level user instruction, current screenshot, and a compressed state summary from the State Tracker.
- Output: Clear, executable "atomic instructions" (e.g., "Click the search bar") for the Executor. It does not perform actions itself.
Executor (The "I/O Device"):
- Role: Precise action execution.
- Function: A frozen, pre-trained Vision-Language Model (VLM) that takes the atomic instruction and current screenshot to generate specific actions (coordinates, text input). It is not trained in this framework, acting as a plug-and-play component to provide verifiable feedback.
State Tracker (The "Dynamic Memory"):
- Role: Context compression and state management.
- Function: A language model that observes the Executor's output and updates a high-semantic, natural language summary of the task progress. It filters visual noise and maintains a coherent record of the task state, solving the "state loss" problem.

B. Training Strategy: Staged Execution-Feedback RL

Instead of training a unified policy, the authors introduce a Staged Execution-Feedback Reinforcement Learning (RL) algorithm. The core insight is to use the frozen Executor's verifiable results to train the high-level agents.

Warm-up SFT: Both Coordinator and State Tracker undergo Supervised Fine-Tuning (SFT) to learn basic roles and output formats.
Stage 1: Optimizing the Coordinator:
- The State Tracker is temporarily bypassed (using ground-truth states).
- The Coordinator generates atomic instructions.
- The frozen Executor executes these instructions.
- Reward: A rule-based "Execution-Feedback Reward" is calculated based on whether the Executor's action matches the ground truth. This reward is back-propagated to optimize the Coordinator's planning policy.
Stage 2: Optimizing the State Tracker:
- The Coordinator is now frozen.
- The State Tracker generates state summaries.
- These summaries are fed to the frozen Coordinator and Executor.
- Reward: The final Execution-Feedback Reward from the Executor is used to optimize the State Tracker. The goal is to teach the State Tracker to generate state summaries that enable the Coordinator to make the best possible decisions.

Reward Function: The reward combines format compliance and execution success ( $R = \alpha_1 R_{format} + \alpha_2 R_{executor}$ ), where $R_{executor}$ evaluates action type and parameter correctness.

3. Key Contributions

CES Multi-Agent Framework: A novel architecture that decouples high-level planning and state management from low-level execution, allowing specialized optimization for each role.
State Tracker Module: A dedicated agent for dynamic context compression and high-semantic state summarization, effectively solving the long-horizon state awareness bottleneck.
Staged Execution-Feedback RL: A training paradigm that freezes the Executor and uses its verifiable execution results as a reward signal to exclusively train high-level scheduling models, avoiding optimization conflicts.
Plug-and-Play Generalization: The framework is designed to integrate with any pre-trained Executor model, enhancing its long-horizon capabilities without retraining the Executor itself.

4. Experimental Results

The authors evaluated CES on three long-horizon benchmarks: AITZ, AMEX, and GUI-Odyssey.

Performance Gains: CES significantly outperformed existing baselines (including SFT and RL-based single agents like GUI-R1). On the GUI-Odyssey benchmark, CES improved the Success Rate (SR) by over 15% compared to the best baseline (GUI-R1-7B).
Generalization: The framework was tested with Executors of varying sizes (3B, 7B, 32B).
- Small Models (3B): Showed significant degradation when forced to act as a single agent (CES-P via prompting) due to capability conflicts. However, when integrated into CES, they saw massive improvements, proving the framework alleviates the burden on smaller models.
- Large Models (32B): Also improved, confirming that even powerful models benefit from the decoupled architecture.
Ablation Studies: Removing the Coordinator or State Tracker caused significant performance drops, validating the necessity of both components. Furthermore, using only SFT (without RL) resulted in suboptimal performance, highlighting the importance of execution-feedback RL.
Failure Analysis: The framework almost completely eliminated "State Loss" and "Planning Error" failures, shifting the primary bottleneck to the Executor's inherent perception limits.

5. Significance

This paper represents a paradigm shift in GUI automation research:

From Monolithic to Modular: It moves away from the "one-model-fits-all" approach, demonstrating that specialized division of labor yields superior results for complex tasks.
Solving the State Problem: It provides a concrete, trainable solution to the long-standing issue of state tracking in long-horizon tasks, which previous methods largely ignored or handled via brittle memory mechanisms.
Efficient Training: By freezing the expensive Executor and only training high-level schedulers, the method is computationally efficient and highly adaptable to new base models.
Practical Impact: The "plug-and-play" nature of CES means existing powerful VLMs can be immediately upgraded to handle complex, multi-step workflows without retraining their core visual perception capabilities.

In conclusion, the CES framework establishes a new standard for long-horizon GUI automation by leveraging structured multi-agent collaboration and execution-feedback reinforcement learning to overcome the limitations of single-agent architectures.

Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

The Team: A Three-Person Orchestra

The Secret Sauce: "Learning by Doing" (Reinforcement Learning)

Why This Matters

1. Problem Statement

2. Methodology: The CES Framework

A. Agent Roles

B. Training Strategy: Staged Execution-Feedback RL

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks