Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

The Big Problem: The "Overloaded Backpack"

Imagine you are teaching a robot to navigate a complex video game or a computer interface. To make smart decisions, the robot needs to remember everything it has seen and done so far. In AI terms, this memory is called the KV Cache.

Think of the KV Cache as a backpack the robot carries.

Short tasks: If the robot just needs to click a "Submit" button, the backpack is light. Easy!
Long tasks: If the robot has to solve a 50-step puzzle (like booking a flight, then a hotel, then a car), the backpack gets heavy. It fills up with every single screenshot and action from the last hour.

Eventually, the backpack becomes so heavy (consuming too much computer memory) and so full of junk (redundant data) that the robot moves in slow motion. It can't think fast enough to be useful in real-time.

The Old Solutions: Why They Failed

Researchers tried to fix this by making the backpack lighter, but they used the wrong tools:

The "Recent Memory" Trick (SnapKV): This method assumes the robot only needs to remember the last few things it saw.
- The Flaw: Imagine you are looking for a specific red button on a screen. If you only look at the last 3 seconds, you might miss the button because it appeared 10 seconds ago. The robot forgets the critical clue because it was too focused on the "now."
The "Layered" Trick (PyramidKV): This method assumes that some parts of the brain (layers) need more memory than others, like a pyramid.
- The Flaw: Computer screens (GUIs) are different from movies or photos. They are made of distinct, separate blocks (buttons, icons, text) that are equally important everywhere. The "pyramid" approach tried to throw away information from certain layers, accidentally deleting the very buttons the robot needed to click.

The New Solution: ST-Lite (The Smart Organizer)

The authors created ST-Lite, a new way to organize the robot's backpack. It doesn't require retraining the robot (no extra homework); it just changes how it packs.

ST-Lite uses two clever strategies to keep the backpack light but useful:

1. CSS: The "Spotlight on the Stage" (Component-centric Spatial Saliency)

The Analogy: Imagine a stage with a dark background and a few actors holding bright props.

Old Way: The robot tries to remember every inch of the dark background (the wallpaper, the empty space).
ST-Lite (CSS): This module acts like a smart spotlight. It looks at the screen and says, "Hey, the background is boring and uniform. But look at this button! It has a sharp edge and a different color. That's important!"
Result: It throws away the "boring background" pixels and keeps the "actors" (buttons, icons, text). It preserves the structure of the interface so the robot knows exactly where to click.

2. TSG: The "Time Travel Filter" (Trajectory-aware Semantic Gating)

The Analogy: Imagine you are watching a movie where the camera stays on a static wall for 10 minutes, then cuts to a new scene.

Old Way: The robot remembers every single frame of the static wall. It's a waste of space because nothing changed.
ST-Lite (TSG): This module acts like a video editor. It compares the current frame with the past. If the screen hasn't changed much (e.g., the robot is just waiting), it says, "We already know this. Delete the old copy." It only keeps the new information that actually changed the story.
Result: It stops the robot from being confused by "stale" information. It keeps the history fresh and relevant.

The Magic Result: "Less is More"

The most surprising finding in the paper is that ST-Lite actually makes the robot smarter in some cases.

The "Noise" Problem: When a robot remembers too much (the full backpack), it gets confused by irrelevant details. It's like trying to find a needle in a haystack when the haystack is on fire.
The ST-Lite Effect: By aggressively cutting out the junk (the background and the repetitive frames), ST-Lite removes the "noise."
The Outcome: With only 10-20% of the original memory, the robot runs 2.45 times faster and often makes better decisions than when it had the full memory. It's like cleaning a cluttered desk so you can actually find your tools.

Summary

ST-Lite is a smart packing system for AI robots.

It ignores the boring background (CSS).
It deletes repetitive history (TSG).
It keeps the critical buttons and new changes.

This allows powerful AI agents to run on regular computers (like your laptop) instead of needing massive, expensive supercomputers, making them faster and ready for real-world use.

1. Problem Statement

Autonomous Graphical User Interface (GUI) agents powered by Large Vision-Language Models (VLMs) face a critical bottleneck during long-horizon interactions: the exponential growth of the Key-Value (KV) cache memory footprint and inference latency.

Context: GUI tasks involve high-resolution screenshots and extended interaction trajectories. As the sequence length increases, the KV cache grows linearly, saturating GPU memory and causing severe latency, which prevents real-time deployment on consumer hardware.
Limitations of Existing Methods:
- Window-based Greedy Methods (e.g., SnapKV): Rely on local observation windows. In long-horizon GUI tasks, they fall into local optima traps, failing to capture global spatio-trajectory dependencies and discarding critical historical UI elements that are far from the current view.
- Hierarchical Allocation Methods (e.g., PyramidKV, VL-Cache): Assume that attention sparsity varies across transformer layers (pyramidal distribution). However, the authors demonstrate that GUI attention patterns exhibit uniform high-sparsity across all layers due to the discrete, structured nature of UI elements. This misalignment leads to structural misalignment and the loss of critical UI components.

2. Methodology: ST-Lite Framework

The authors propose ST-Lite (Spatio-Trajectory Lite), a training-free KV cache compression framework designed specifically for the unique characteristics of GUI data streams. It operates by explicitly mining local spatial distinctiveness and trajectory-aware semantic evolution.

The framework consists of two core components:

A. Component-centric Spatial Saliency (CSS)

Goal: Preserve the structural integrity of interactive UI elements (buttons, icons, text) while filtering out uniform background noise.
Mechanism:
- Unlike natural images with smooth textures, GUIs have discrete functional elements.
- CSS utilizes a Moore Neighborhood (3x3 grid) to evaluate the local manifold structure of visual tokens.
- It calculates a Local Uniformity Score based on the average cosine similarity between a central token and its 8 neighbors.
- Spatial Saliency Score ( $\Phi_{space}$ ): Defined as $1 - \text{Uniformity}$. High scores indicate tokens at semantic boundaries (edges of buttons/icons), while low scores indicate uniform backgrounds.
- Result: CSS prioritizes tokens with high local distinctiveness, ensuring the "skeleton" of the GUI is retained even under extreme compression.

B. Trajectory-aware Semantic Gating (TSG)

Goal: Eliminate historical redundancy in long interaction sequences.
Mechanism:
- GUI workflows often contain visually repetitive states (e.g., static backgrounds remaining unchanged while the user interacts with a specific element).
- TSG compares historical hidden states ( $H_{his}$ ) with the current frame ( $H_{cur}$ ).
- It calculates a Redundancy Score ( $\rho$ ) for each historical token as its maximum cosine similarity with the current frame.
- Dynamic Thresholding: A dynamic threshold ( $\tau_{red}$ ) is set based on the target cache budget. Tokens with redundancy scores above this threshold are evicted.
- Result: TSG acts as a semantic filter, retaining only unique historical states required for reasoning and pruning "stale" KV pairs that cause context poisoning.

C. Integrated Scoring Policy

The final retention score for a token combines the Base Attention Prior (from the observation window), the Spatial Saliency (CSS), and the Semantic Gate (TSG). The system selects the top- $B$ tokens to construct the compressed cache.

3. Key Contributions

Systematic Diagnostic Analysis: The authors rigorously analyzed existing compression methods on GUI benchmarks, identifying a fundamental misalignment: GUIs exhibit uniform high-sparsity across all transformer layers, contradicting the hierarchical assumptions of methods like PyramidKV. They also proved that window-based methods fail to capture global dependencies in long trajectories.
ST-Lite Framework: Introduced a novel, training-free compression strategy that aligns with GUI structural properties. It combines CSS for spatial structural preservation and TSG for historical redundancy filtering.
Empirical Validation: Demonstrated that ST-Lite achieves a superior trade-off between efficiency and performance, enabling high-performance agents to run on memory-constrained hardware without auxiliary training.

4. Experimental Results

The method was evaluated on diverse benchmarks including ScreenSpot Pro, AITW (Android in the Wild), and AgentNetBench.

Performance under Extreme Budgets:
- With only 10-20% of the KV cache budget, ST-Lite achieves 2.45× decoding acceleration.
- It maintains comparable or superior performance to full-cache baselines. For example, on AITW with a 20% budget, ST-Lite achieved a 20.7% success rate, outperforming the full-cache baseline (18.7%).
"Less-is-More" Phenomenon:
- In long-horizon tasks, aggressive compression via TSG actually improves performance over full-cache baselines. This is attributed to the removal of "Context Poisoning" (semantic noise from irrelevant historical frames) which distracts the model.
Ablation Studies:
- CSS was critical for single-frame precision tasks (ScreenSpot Pro), preserving element grounding.
- TSG was critical for multi-step reasoning (AITW, AgentNetBench), preventing performance decay as history length increased.
- The full ST-Lite framework outperformed all baselines (SnapKV, PyramidKV, VL-Cache) across different model architectures (UI-TARS-1.5-7B and OpenCUA-7B).
Efficiency:
- The prefill phase incurs negligible overhead (~1.0× speedup).
- The decoding phase sees massive gains (up to 2.45× speedup) due to reduced memory bandwidth pressure.
- End-to-end speedup reaches 1.4× even with the fixed cost of the vision encoder.

5. Significance

Scalability for Real-World Deployment: ST-Lite provides a scalable solution for deploying autonomous GUI agents on consumer-grade hardware, overcoming the memory and latency barriers that currently limit long-horizon automation.
Paradigm Shift: The paper shifts the compression paradigm from passive retention (keeping everything or keeping based on simple heuristics) to active, semantics-driven selection. It proves that for GUI agents, "less is more" when the "less" is carefully curated to remove redundancy while preserving structural and temporal criticality.
Generalizability: The approach is model-agnostic and training-free, making it immediately applicable to existing VLM-based GUI agents without the need for retraining or fine-tuning.