CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks

Imagine you are trying to teach a very smart, but slightly naive, robot how to use a smartphone. Your goal is for the robot to be able to tap, swipe, and type just like a human to order food, check bank balances, or book a doctor's appointment.

This paper introduces CRAFT-GUI, a new way of training these robots. Think of it as a personalized, step-by-step gym plan for your robot's brain, rather than just throwing it into a chaotic obstacle course.

Here is the breakdown using simple analogies:

1. The Problem: The "One-Size-Fits-All" Mistake

Previously, when training these AI agents, researchers treated every task the same.

The Old Way: Imagine a teacher trying to teach a child to read. They hand the child a picture book, a dictionary, and a PhD thesis all at once, saying, "Figure it out!" The child gets overwhelmed, confused, and learns nothing.
The Reality: Some phone tasks are easy (like "tap the 'back' button"). Others are hard (like "find a specific restaurant, check the menu, switch the delivery address to a specific floor, and pay with a specific card").
The Flaw: Old methods didn't notice the difference. They gave the robot the same "reward" (a pat on the head) for solving a puzzle as they did for solving a math problem. This made the robot's learning messy and inefficient.

2. The Solution: The "Curriculum" (Schooling)

The authors propose CRAFT-GUI, which stands for Curriculum-Reinforced Agent for Fine-grained Tasks.

Think of this as a school system for the robot:

Kindergarten (Stage 1): The robot starts with very easy tasks (1–3 steps). "Tap the green button." It learns the basics of how to move its finger.
Middle School (Stage 2): Once it masters the basics, it moves to medium tasks (4–8 steps). "Open the app, find the settings, and change the volume."
University (Stage 3): Finally, it tackles complex, multi-step challenges (8+ steps) that require understanding context. "Find a pizza place, check if they deliver to my new address, and pay using my saved card."

By forcing the robot to master simple things before moving to hard things, it builds a solid foundation, just like a human student.

3. The Coach: The "Smart Reward System"

In training AI, the "reward" is like a coach telling the student, "Good job!" or "Try again."

The Old Coach: Was very blunt. "Did you finish the task? Yes? Here is a cookie. No? No cookie." It didn't care how you did it.
The CRAFT-GUI Coach: Is a detail-oriented mentor.
- Did you click the right button? Yes? +1 point.
- Did you swipe in the right direction? Yes? +1 point.
- Did you type the address correctly? Yes? +1 point.
- Did you talk too much? The coach gently says, "You're rambling; let's keep it concise," and gives a small penalty.

This "fine-grained" feedback helps the robot understand why it succeeded or failed, not just that it did.

4. The Results: From Novice to Pro

The researchers tested this method on two types of challenges:

Public Benchmarks: Standard tests everyone uses (like AndroidWorld).
Private "Real World" Tests: A custom dataset with 80,000 real-life scenarios (food delivery, banking, gaming, etc.).

The Outcome:

The new method (CRAFT-GUI) beat the previous best methods by a significant margin (about 7% to 10% better).
In the real-world tests, the robot went from being a clumsy beginner to a highly competent assistant, successfully completing complex tasks that previous robots failed at.

Summary

CRAFT-GUI is like upgrading from a chaotic "trial and error" approach to a structured, intelligent tutoring system.

It teaches the robot step-by-step (Curriculum).
It gives specific, helpful feedback on every move (Fine-grained Rewards).
It mixes doing (clicking) with thinking (understanding the screen) to create a truly smart agent.

The result is an AI that doesn't just blindly tap screens but actually understands how to navigate our digital world, one step at a time.

, , `) to ensure reasoning transparency and operational consistency.

Length Penalty ( $P_{length}$ ): An adaptive penalty to prevent "overthinking" (explosive token generation), inspired by DAPO. It applies graduated penalties for outputs exceeding a threshold $L_{max}$ .

2. Visual Understanding Tasks:
For tasks like VQA and element localization, the reward ( $R_{understanding}$ ) combines:

Semantic Reward ( $R_{sem}$ ): Evaluated via an LLM-as-a-judge approach to assess alignment with ground truth text.
Format and Length Penalties: Similar to operation tasks to maintain structure and efficiency.

3. Key Contributions

Curriculum RL Strategy: A systematic approach that progresses from simple to complex GUI tasks based on trajectory characteristics, stabilizing training and improving sample efficiency.
Fine-Grained Hybrid Rewards: A novel reward mechanism combining rule-based verifiable signals (for operations) and model-judged evaluation (for semantics), providing rich, multi-dimensional feedback.
Joint Training of Operations and Understanding: The framework simultaneously trains agents on low-level action competence (clicking/swiping) and high-level task comprehension (reasoning/localization), creating a more versatile agent.

4. Experimental Results

The authors evaluated CRAFT-GUI using Qwen2.5-VL (7B and 32B parameters) on both public benchmarks and a private internal dataset.

Public Benchmarks (AndroidWorld):
- CRAFT-GUI (32B, Stage 3) achieved a 51.7% success rate on AndroidWorld.
- This represents a 7.1% improvement over the previous state-of-the-art (UI-TARS-72B at 46.6%).
Private Dataset (6 Mobile App Categories):
- Evaluated on food delivery, dining, medical, finance, insurance, and gaming apps.
- CRAFT-GUI (32B) achieved a 75.7% average success rate.
- This is a 10.3% improvement over the best baseline (Claude-3.7-Sonnet/GPT-4.1).
Ablation Studies:
- Curriculum vs. Vanilla RL: Curriculum learning yielded a 3.8% gain over standard RL and a 14.9% gain over Supervised Fine-Tuning (SFT).
- Data Mixture: Including visual understanding tasks in the curriculum improved operation success rates by 2.5% compared to using operation data alone, demonstrating the benefit of joint training.

5. Significance

CRAFT-GUI addresses the "difficulty gap" in GUI automation by mimicking human learning progression. By moving away from uniform data treatment and coarse rewards, the method achieves:

Stability: Prevents training collapse and over-generation through curriculum staging and length penalties.
Generalization: The agent learns to handle both simple atomic actions and complex, multi-step reasoning tasks effectively.
Scalability: The GRPO-based approach is computationally efficient, making it suitable for training large multimodal models on limited hardware resources.

This work establishes a new paradigm for training GUI agents, suggesting that difficulty-aware curriculum learning combined with fine-grained verification is essential for the next generation of autonomous software interaction.

CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks

1. The Problem: The "One-Size-Fits-All" Mistake

2. The Solution: The "Curriculum" (Schooling)

3. The Coach: The "Smart Reward System"

4. The Results: From Novice to Pro

Summary

3. Key Contributions

4. Experimental Results

5. Significance

More like this

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

Linear Programming for Multi-Criteria Assessment with Cardinal and Ordinal Data: A Pessimistic Virtual Gap Analysis

Seven simple steps for log analysis in AI systems

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

AHC: Meta-Learned Adaptive Compression for Continual Object Detection on Memory-Constrained Microcontrollers