CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

Imagine you have a very smart, digital personal assistant named Alex. Alex is great at helping you use your phone: finding recipes, booking flights, or checking your email. But here's the problem: phone apps change all the time. Buttons move, menus get redesigned, and new features appear.

If you teach Alex how to use a brand-new version of an app, he might get so focused on the new instructions that he forgets how to use the old version. It's like if you learned to drive a new car with a touchscreen, and suddenly you forgot how to use the old car with physical knobs. This is called "catastrophic forgetting," and it's a huge headache for AI researchers.

This paper introduces a new method called CGL (Continual GUI Learning) to solve this problem. Here is how it works, explained through simple analogies:

The Problem: Two Bad Ways to Learn

The researchers found that current AI training methods are like two different types of students, both with flaws:

The "Crammer" (Supervised Fine-Tuning / SFT):
- How they learn: They memorize the new instructions perfectly and quickly.
- The flaw: When they learn the new stuff, they overwrite their old notes. It's like a student who studies for a new math test so hard they forget how to do the old multiplication tables. They are fast at learning new things but terrible at remembering old ones.
The "Slow Learner" (Reinforcement Learning / RL):
- How they learn: They try to figure things out by trial and error, like a toddler learning to walk. They are very careful not to forget what they already know.
- The flaw: They are incredibly slow. If the app changes, they might spend days just guessing, whereas the "Crammer" would have figured it out in minutes.

The Solution: The "Balanced Coach" (CGL)

The CGL framework acts like a wise coach who knows when to use the "Crammer" and when to use the "Slow Learner." It combines the best of both worlds using three clever tricks:

1. The "Emergency Brake" (Error-Aware Routing)

Imagine the AI is trying to learn a new app feature by guessing (RL). If it keeps guessing wrong and getting stuck, the coach steps in.

The Metaphor: It's like a driving instructor seeing a student spin their wheels in the mud. Instead of letting them keep spinning, the instructor says, "Okay, stop guessing. Here is the exact path (SFT). Follow this."
Result: The AI gets the speed of the "Crammer" only when it's truly stuck, saving time without losing its memory.

2. The "Confidence Thermostat" (Entropy-Regulated Tuning)

The coach watches how "confused" the AI is.

The Metaphor: Think of the AI's brain like a room.
- High Confusion (High Entropy): The room is chaotic. The coach turns up the heat (increases the "Crammer" mode) to shake things up and force the AI to learn the new rules.
- Low Confusion (Low Entropy): The room is calm and organized. The coach turns down the heat (decreases the "Crammer" mode) and lets the AI settle into its routine so it doesn't accidentally erase its old memories.
Result: The AI learns fast when it's confused but stabilizes when it's getting the hang of things.

3. The "Conflict Filter" (Gradient Surgery)

Sometimes, the instructions for the new task clash with the instructions for the old task.

The Metaphor: Imagine you are trying to paint a new picture on a canvas, but your brush strokes keep smudging the old masterpiece underneath.
- The "Conflict Filter" acts like a smart stencil. It looks at the new paint strokes and says, "Okay, this part of the stroke helps us learn the new thing, but this part will ruin the old painting."
- It cuts out the "ruining" part and only applies the helpful part of the new instruction.
Result: The AI learns the new app without smudging the old one.

The New Playground: AndroidControl-CL

To prove this works, the researchers built a new test track called AndroidControl-CL.

The Metaphor: Instead of testing the AI on just one app, they created a "gym" with 7 different types of apps (Shopping, Travel, Office, etc.).
They made the AI learn them one by one, like a student taking classes in a sequence.
The Result: The CGL method was the only one that could learn the new classes quickly and still remember how to do the first class perfectly.

Why This Matters

In the real world, apps update constantly. If your AI assistant forgets how to use your banking app every time the bank updates its design, it's useless.

CGL is the key to building an AI that grows with you. It learns new tricks quickly without forgetting the old ones, making it a truly reliable, lifelong digital companion.

Here is a detailed technical summary of the paper "CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning".

1. Problem Definition

The paper addresses the challenge of Continual Learning (CL) for Graphical User Interface (GUI) Agents.

Context: GUI agents, powered by Multimodal Large Language Models (MLLMs), automate interactions with software. However, real-world GUIs evolve rapidly (new apps, UI updates), requiring agents to learn new tasks continuously without forgetting previously mastered skills.
The Core Conflict: Existing approaches typically rely on either Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL):
- SFT: Offers fast adaptation to new tasks but suffers from catastrophic forgetting, overwriting prior interaction logic.
- RL (specifically GRPO): Demonstrates inherent resilience in preserving old knowledge but suffers from slow adaptation and high sample complexity, often failing to converge on new patterns when rewards are sparse.
Goal: To develop a framework that balances plasticity (learning new tasks efficiently) and stability (retaining old skills) in a sequential learning setting without access to historical data.

2. Methodology: The CGL Framework

The authors propose the Continual GUI Learning (CGL) framework, which synergistically integrates SFT and Group Relative Policy Optimization (GRPO). The framework consists of three core modules:

A. Error-Aware Routing (Dynamic SFT Triggering)

Problem: Pure RL (GRPO) can get stuck in unproductive exploration if the agent cannot discover the correct action sequence for a new task (sparse rewards).
Solution: The system monitors the rewards of sampled trajectories. If the maximum reward falls below a threshold (indicating failure to find a solution), the system dynamically routes the update to a supervised fine-tuning (SFT) step using ground-truth demonstrations.
Mechanism: This injects corrective knowledge only when RL exploration fails, preventing the model from wasting time on pathological biases.

B. Entropy-Regulated Tuning (Dynamic Weighting)

Problem: Balancing the weight between SFT (for learning) and GRPO (for retention) is difficult with static hyperparameters.
Solution: A dynamic weighting factor $\lambda$ $λ$ is adjusted based on policy entropy ( $H$ $H$ ), which measures the uncertainty of the agent's action distribution.
- Phase 1 (Warmup/Entropy Injection): When the model is biased toward errors (low entropy on wrong actions), $\lambda$ is increased to inject SFT updates, "heating up" the distribution to break local minima.
- Phase 2 (Convergence/Entropy Decay): As the model learns the task and entropy decreases, $\lambda$ decays exponentially. This allows GRPO to dominate, stabilizing the policy and preventing interference with established knowledge.

C. Conditional Gradient Surgery (Conflict Resolution)

Problem: Simultaneous optimization of SFT and GRPO can lead to gradient conflicts, where updates for new tasks destroy old knowledge.
Solution: The method projects SFT gradients onto a subspace orthogonal to the GRPO gradients.
- Conflict Detection: If the cosine similarity between the SFT gradient and GRPO gradient is negative (angle > 90°), a conflict exists.
- Projection: The component of the SFT gradient that opposes the GRPO direction is surgically removed (clipped). Only the orthogonal or aligned components are retained.
- Result: This ensures that SFT updates for new tasks do not erode the "logical redline" of previously acquired skills.

3. Key Contributions

Insight into Optimization Paths: The paper reveals a fundamental divergence: SFT causes knowledge overwriting, while RL (GRPO) preserves interaction logic but adapts slowly. CGL bridges this gap.
CGL Framework: A novel architecture combining Error-Aware Routing, Entropy-Regulated Tuning, and Gradient Surgery to achieve a superior stability-plasticity trade-off.
AndroidControl-CL Benchmark:
- A new dataset extending AndroidControl, partitioning GUI tasks into 7 functional super-classes (Shopping, Productivity, Communication, Travel, Tools, Education, Entertainment).
- Includes fine-grained bounding box annotations for click actions (moving beyond single-point coordinates) and explicit app identity labels.
- Designed to simulate realistic software versioning and distribution shifts.
Comprehensive Evaluation: Extensive experiments across different model scales (0.5B and 3B parameters) and task orders.

4. Experimental Results

The framework was evaluated on LLaVA-OneVision-0.5B and Qwen2.5-VL-3B using three different task sequences.

Performance: CGL achieved the highest Average Step-wise Accuracy (82.33% on Qwen2.5-3B) and Trajectory-wise Accuracy (38.03%), outperforming baselines like pure SFT, SFT+KL, SFT+Replay, GRPO, and RIF-RFT.
Forgetting Mitigation: CGL demonstrated near-zero Forgetting Measure (FM) (-0.02 on Qwen2.5-3B), significantly outperforming SFT (-5.73) and GRPO (-0.62).
Positive Transfer: In specific task orders (Order 2), CGL achieved a positive FM (+0.13), indicating that learning new tasks actually reinforced performance on old tasks—a rare phenomenon in continual learning.
Ablation Studies: Confirmed that each component (Entropy regulation, Gradient surgery, Dynamic routing) contributes uniquely to the final performance.

5. Significance

Theoretical Advance: Provides a rigorous theoretical analysis of entropy dynamics in hybrid SFT-RL training, explaining how entropy injection and decay drive convergence.
Practical Impact: Solves a critical bottleneck for deploying GUI agents in dynamic real-world environments where apps are constantly updated.
Benchmark Standard: Establishes AndroidControl-CL as a standardized platform for evaluating continual learning in GUI agents, filling a gap in existing literature which largely focused on static tasks.
Generalizability: The method is model-agnostic (works on both small and large MLLMs) and offers a blueprint for integrating RL and SFT in other sequential learning domains.

In summary, CGL represents a significant step forward in making GUI agents robust and adaptable, effectively preventing catastrophic forgetting while maintaining the agility to learn new interface interactions.