HiconAgent: History Context-aware Policy Optimization for GUI Agents

Imagine you are trying to teach a robot to navigate a smartphone or a computer screen to complete a task, like booking a flight or buying shoes. This robot is powered by a "brain" (an AI model) that looks at the screen, reads your instructions, and decides what to click next.

The problem is that these tasks often take many steps. To make good decisions, the robot needs to remember what happened in the past (the "history"). But here's the catch: remembering everything is a double-edged sword.

If the robot remembers too little: It forgets where it started or what it just did, leading to confusion (e.g., typing the destination city instead of the departure city).
If the robot remembers too much: It gets overwhelmed. It tries to look at every single screenshot from the last 10 minutes, which slows it down to a crawl and confuses it with irrelevant details (like a student trying to read a whole textbook to answer one specific question).

HiconAgent is a new, smarter way to train these robots. The researchers call it History Context-aware Policy Optimization (HCPO). Think of it as teaching the robot how to remember, rather than just forcing it to remember everything.

Here is how it works, broken down into two simple tricks:

1. The "Dynamic Context Sampling" (The Flexible Memory)

The Analogy: Imagine a student taking a test.

Old Way: The teacher forces the student to look at the last 5 pages of their notes for every single question, even if the answer is right in front of them on the current page. This wastes time and causes distraction.
HiconAgent's Way: The teacher tells the student, "For some questions, just look at the current page. For others, glance back 1 page. For the really hard ones, look back 3 pages."
How it works: During training, the robot is randomly given different amounts of history (0 steps, 1 step, or 2 steps back). It learns to figure out: "Hey, for this specific step, I only need to remember what I did 10 seconds ago. For that other step, I need to remember what I did 2 minutes ago." It learns to be flexible, using just the right amount of memory for the job.

2. The "Anchor-Guided History Compression" (The Highlighter)

The Analogy: Imagine you are reading a long, boring transcript of a conversation.

Old Way: You try to read every single word spoken by everyone, including the "umms," "ahhs," and descriptions of the room. It's exhausting.
HiconAgent's Way: You use a highlighter. You realize that while the visuals (the screenshots) are huge and heavy, the actions (what the user clicked or typed) are the most important "anchors."
How it works: The robot is taught to keep the action history (e.g., "I clicked 'Login'") but to drop the visual history (the actual screenshots of the login screen) after a certain point.
- Think of the action as a bookmark. Even if you throw away the old pages of the book, if you keep the bookmark, you know exactly where you were.
- The robot keeps the "bookmarks" (actions) but deletes the heavy "pages" (screenshots) to save energy and speed, while still knowing the context.

The Result: A Smarter, Faster Robot

By combining these two tricks, the researchers created HiconAgent.

It's smaller but stronger: They trained a model with only 3 billion parameters (a relatively small brain).
It beats the giants: Despite being smaller, it outperformed a much larger 7-billion-parameter model (GUI-R1-7B) on difficult navigation tasks.
It's incredibly fast: Because it stops trying to process useless old screenshots, it runs 2.5 times faster and uses 60% less computing power.

In a Nutshell

Previous AI agents were like a person trying to drive a car while reading the entire manual, the map, and the radio transcript all at once. HiconAgent is like a driver who knows exactly when to check the rearview mirror, when to ignore it, and how to keep just the essential notes in their pocket. It makes decisions faster, uses less fuel (computing power), and gets to the destination more reliably.

1. Problem Statement

Graphical User Interface (GUI) agents, powered by Multimodal Large Language Models (MLLMs), rely on historical context (past screenshots and actions) to perform sequential navigation tasks. However, existing approaches face a critical trade-off between decision quality and computational efficiency:

Naive Full History: Incorporating complete historical observations (screenshots) and actions leads to excessive computational overhead due to the quadratic complexity of attention mechanisms and the sheer volume of visual tokens. Furthermore, full history often introduces irrelevant information that distracts the model.
Simplified History: Many current Reinforcement Learning (RL) frameworks omit historical observations entirely, using only past actions as context. While efficient, this discards rich visual cues essential for resolving ambiguous instructions, grounding visually similar elements, and maintaining temporal consistency.

The core challenge is to develop a method that effectively utilizes the most informative parts of historical context while mitigating redundancy, without sacrificing decision quality or incurring high computational costs.

2. Methodology: HiconAgent & HCPO

The authors propose HiconAgent, a GUI agent trained with History Context-aware Policy Optimization (HCPO). HCPO is a reinforcement fine-tuning framework that improves both the sampling and policy update phases of training through two complementary components:

A. Dynamic Context Sampling (DCS)

Goal: Address the variability of history dependence across different decision steps. Fixed-length history is often suboptimal.
Mechanism: During the rollout phase, instead of using a fixed history length, DCS samples multiple history variants for each trajectory.
Distribution Strategy: It employs an exponential-biased distribution ( $ExpBias$ $E x pB ia s$ ) rather than a uniform one.
- Early Training: The distribution is nearly uniform, encouraging random exploration of short and long histories.
- Late Training: The distribution shifts to favor larger history lengths ( $\tau$ ), ensuring the model learns to utilize full context when beneficial.
Outcome: This prevents training collapse (degeneration) and forces the policy to adaptively learn which history lengths yield the best rewards for specific tasks.

B. Anchor-guided History Compression (AHC)

Goal: Reduce computational redundancy during the policy update phase while preserving decision signals.
Key Insight (Layer-wise Analysis): Through empirical analysis, the authors discovered that history actions ( $A_{his}$ $A_{hi s}$ ) serve as critical "anchors" for information flow.
- In shallow layers ( $k < 12$ ), dropping action tokens causes massive performance degradation, while dropping visual tokens ( $V_{his}$ ) is tolerable.
- Action tokens aggregate visual semantics and pass them to deeper layers.
Mechanism: AHC uses a dual-branch architecture:
1. Uncompressed Branch: Processes the full history (actions + observations) to generate high-quality responses and advantages.
2. Compressed Branch: After an early fusion depth ( $k$ ), it drops all history visual tokens ( $V_{his}$ ) but retains history action tokens ( $A_{his}$ ) as anchors.
Alignment Loss: The compressed branch is trained to match the output distribution of the uncompressed branch using a history-enhanced alignment loss (KL divergence). This ensures the compressed model retains the decision-making capability of the full-history model despite the reduced input.

C. Reward Design

The framework utilizes a rule-based RL approach with three reward components:

Format Reward: Ensures structured output (e.g., <thought>...</thought><action>...</action>).
Action Type Reward: Binary reward for correct action classification (e.g., CLICK, TYPE).
Action Value Reward: F1 score for text, exact match for discrete values, and Euclidean distance-based continuous reward for coordinates.

3. Key Contributions

Empirical Analysis of History Usage: The paper provides a comprehensive study revealing that different tasks and steps prefer different history lengths and that action tokens act as critical anchors for visual information flow in MLLMs.
HCPO Framework: Introduction of a novel RL framework combining Dynamic Context Sampling (DCS) and Anchor-guided History Compression (AHC). This allows agents to learn adaptive history usage while significantly reducing redundancy.
Efficiency-Performance Breakthrough: Demonstrates that a smaller model (3B parameters) can outperform larger models (7B parameters) by optimizing how history is used, rather than just increasing model size.

4. Experimental Results

The model was evaluated on three mainstream GUI navigation benchmarks: AndroidControl-High, AITW, and GUI-Odyssey.

Performance vs. Larger Models:
- HiconAgent-3B outperforms GUI-R1-7B (a 7B parameter model) on the GUI-Odyssey benchmark by +8.46% in grounding accuracy and +11.32% in step success rate.
- It achieves comparable or superior results on AndroidControl and AITW.
Efficiency Gains:
- Computational Speedup: Up to 2.47× faster inference.
- FLOPs Reduction: Up to 60% reduction in floating-point operations.
Generalization:
- Trained on only 3K unfiltered samples, HiconAgent-3B achieves the highest average step success rate (51.47%) across all benchmarks, outperforming models trained on millions of filtered samples (e.g., OS-Atlas-7B, infiGUI-3B).
Ablation Studies:
- Removing DCS or AHC significantly drops performance.
- The exponential-biased sampling distribution proves superior to uniform sampling, preventing the degradation of short-history response quality.
- The dual-branch alignment is crucial; training only the compressed branch without the uncompressed teacher leads to poor performance.

5. Significance

This work represents a paradigm shift in training GUI agents. Instead of relying on massive model scaling or massive datasets, HiconAgent demonstrates that intelligent context management is the key to high performance.

Practicality: It offers a path toward lightweight, high-performance GUI agents that can run efficiently on edge devices or in resource-constrained environments.
Theoretical Insight: The discovery that "actions serve as anchors for visual information" provides a new direction for compressing multimodal sequences in LLMs, suggesting that preserving the sequence of decisions is more critical than preserving every visual frame once initial fusion has occurred.
Scalability: The method achieves state-of-the-art results with minimal data curation, highlighting the effectiveness of the proposed optimization strategy over brute-force data scaling.

HiconAgent: History Context-aware Policy Optimization for GUI Agents

1. The "Dynamic Context Sampling" (The Flexible Memory)

2. The "Anchor-Guided History Compression" (The Highlighter)

The Result: A Smarter, Faster Robot

In a Nutshell

1. Problem Statement

2. Methodology: HiconAgent & HCPO

A. Dynamic Context Sampling (DCS)

B. Anchor-guided History Compression (AHC)

C. Reward Design

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers