Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

The paper proposes ST-Lite, a training-free KV cache compression framework that leverages the uniform high-sparsity of GUI attention patterns through a dual-branch scoring policy of spatial saliency and trajectory-aware semantic gating, achieving significant decoding acceleration with minimal performance loss in long-horizon GUI agents.

Bowen Zhou, Zhou Xu, Wanli Li, Jingyu Xiao, Haoqian Wang

Published 2026-03-03
📖 5 min read🧠 Deep dive

The Big Problem: The "Overloaded Backpack"

Imagine you are teaching a robot to navigate a complex video game or a computer interface. To make smart decisions, the robot needs to remember everything it has seen and done so far. In AI terms, this memory is called the KV Cache.

Think of the KV Cache as a backpack the robot carries.

  • Short tasks: If the robot just needs to click a "Submit" button, the backpack is light. Easy!
  • Long tasks: If the robot has to solve a 50-step puzzle (like booking a flight, then a hotel, then a car), the backpack gets heavy. It fills up with every single screenshot and action from the last hour.

Eventually, the backpack becomes so heavy (consuming too much computer memory) and so full of junk (redundant data) that the robot moves in slow motion. It can't think fast enough to be useful in real-time.

The Old Solutions: Why They Failed

Researchers tried to fix this by making the backpack lighter, but they used the wrong tools:

  1. The "Recent Memory" Trick (SnapKV): This method assumes the robot only needs to remember the last few things it saw.
    • The Flaw: Imagine you are looking for a specific red button on a screen. If you only look at the last 3 seconds, you might miss the button because it appeared 10 seconds ago. The robot forgets the critical clue because it was too focused on the "now."
  2. The "Layered" Trick (PyramidKV): This method assumes that some parts of the brain (layers) need more memory than others, like a pyramid.
    • The Flaw: Computer screens (GUIs) are different from movies or photos. They are made of distinct, separate blocks (buttons, icons, text) that are equally important everywhere. The "pyramid" approach tried to throw away information from certain layers, accidentally deleting the very buttons the robot needed to click.

The New Solution: ST-Lite (The Smart Organizer)

The authors created ST-Lite, a new way to organize the robot's backpack. It doesn't require retraining the robot (no extra homework); it just changes how it packs.

ST-Lite uses two clever strategies to keep the backpack light but useful:

1. CSS: The "Spotlight on the Stage" (Component-centric Spatial Saliency)

The Analogy: Imagine a stage with a dark background and a few actors holding bright props.

  • Old Way: The robot tries to remember every inch of the dark background (the wallpaper, the empty space).
  • ST-Lite (CSS): This module acts like a smart spotlight. It looks at the screen and says, "Hey, the background is boring and uniform. But look at this button! It has a sharp edge and a different color. That's important!"
  • Result: It throws away the "boring background" pixels and keeps the "actors" (buttons, icons, text). It preserves the structure of the interface so the robot knows exactly where to click.

2. TSG: The "Time Travel Filter" (Trajectory-aware Semantic Gating)

The Analogy: Imagine you are watching a movie where the camera stays on a static wall for 10 minutes, then cuts to a new scene.

  • Old Way: The robot remembers every single frame of the static wall. It's a waste of space because nothing changed.
  • ST-Lite (TSG): This module acts like a video editor. It compares the current frame with the past. If the screen hasn't changed much (e.g., the robot is just waiting), it says, "We already know this. Delete the old copy." It only keeps the new information that actually changed the story.
  • Result: It stops the robot from being confused by "stale" information. It keeps the history fresh and relevant.

The Magic Result: "Less is More"

The most surprising finding in the paper is that ST-Lite actually makes the robot smarter in some cases.

  • The "Noise" Problem: When a robot remembers too much (the full backpack), it gets confused by irrelevant details. It's like trying to find a needle in a haystack when the haystack is on fire.
  • The ST-Lite Effect: By aggressively cutting out the junk (the background and the repetitive frames), ST-Lite removes the "noise."
  • The Outcome: With only 10-20% of the original memory, the robot runs 2.45 times faster and often makes better decisions than when it had the full memory. It's like cleaning a cluttered desk so you can actually find your tools.

Summary

ST-Lite is a smart packing system for AI robots.

  • It ignores the boring background (CSS).
  • It deletes repetitive history (TSG).
  • It keeps the critical buttons and new changes.

This allows powerful AI agents to run on regular computers (like your laptop) instead of needing massive, expensive supercomputers, making them faster and ready for real-world use.