Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Imagine you are trying to teach a robot to use your smartphone or navigate a website. You show the robot a series of screenshots (like a comic strip of what happened before) and the current screen, asking it to figure out what to click next.

The problem is that screenshots are huge. If you show the robot 10 or 20 high-resolution images, it gets overwhelmed. It's like trying to read a 500-page novel to find one specific sentence; the robot wastes time reading pages that don't matter, gets tired (slow), and sometimes makes up things that aren't there (hallucinations).

This paper introduces GUIPruner, a smart "editor" that helps the robot focus only on what matters, without needing to be retrained. It solves two main problems using two clever tricks:

1. The "Fading Memory" Trick (For Old Screens)

The Problem:
When you remember a task, you remember the very last thing you did in perfect detail. But you remember what you did 10 minutes ago only as a fuzzy outline.
Existing robots, however, treat every old screenshot with the same high detail as the current one. It's like trying to remember the exact color of your shirt from three years ago while you are trying to solve a math problem right now. It's a waste of brainpower.

The Solution: Temporal-Adaptive Resolution (TAR)
GUIPruner acts like a human with a "fading memory."

The Recent Past: It keeps the most recent screenshots in High Definition (crystal clear).
The Distant Past: As the screenshots get older, it slowly shrinks them down, turning them into blurry thumbnails.
The Analogy: Imagine looking at a long trail of footprints. You look closely at the footprints right next to you to see where you are stepping. But for the footprints from an hour ago, you just glance at the general path. You don't need to count the pebbles in the dirt from an hour ago to know where to step next. This saves a massive amount of energy.

2. The "Blueprint" Trick (For the Current Screen)

The Problem:
A typical app screen is mostly empty space (background) with a few buttons and text boxes (foreground).
If you just randomly delete the "boring" background parts to save space, you might accidentally delete the grid lines that tell the robot where things are.

The Danger: If you delete the grid, the robot might think a button is in the top-left corner when it's actually in the bottom-right. This is called a "spatial hallucination"—the robot sees a button that isn't there or clicks the wrong spot.

The Solution: Stratified Structure-aware Pruning (SSP)
GUIPruner acts like a careful architect who knows how to renovate a house without knocking down the load-bearing walls. It keeps three specific things:

The Stars (Foreground): It keeps the interactive buttons and input boxes in high detail. These are the "actors" on stage.
The Context (Important Background): It keeps a few key background clues that help the robot understand the scene (like a menu bar or a logo).
The Blueprint (The Grid): This is the magic part. Even if it deletes most of the empty space, it leaves behind a skeleton grid (like a wireframe).

The Analogy: Imagine you are describing a city to a blind person. You don't need to describe every single brick in every building. But you must keep the street grid and the major intersections. GUIPruner keeps the "streets" (the grid) so the robot never gets lost, even if it deletes 70% of the "buildings" (the empty pixels).

The Result: A Super-Efficient Robot

By combining these two tricks, the paper shows that the robot becomes incredibly fast and smart:

Speed: It runs 3.3 times faster because it isn't wasting time processing blurry old photos or empty white space.
Smarts: It actually makes fewer mistakes than before because it isn't confused by too much noise.
No Training Needed: You don't have to teach the robot a new way of thinking. You just plug this "editor" in front of it, and it works immediately.

In short: GUIPruner teaches the robot to remember the recent past clearly, the distant past vaguely, and to always keep the map (the grid) intact so it never gets lost. It's the difference between a robot that is drowning in data and one that is a master navigator.

1. Problem Statement

Pure-vision Graphical User Interface (GUI) agents, which interact with digital environments by analyzing screenshots, face severe efficiency bottlenecks due to spatiotemporal redundancy. Existing compression paradigms fail to address two critical misalignments:

Temporal Mismatch (History Redundancy): Current methods uniformly encode all historical frames at high resolution. However, analysis reveals a "Temporal Decay" pattern where agent attention follows a "Recency Effect"—heavily weighting recent frames while distant history requires only semantic outlines. Uniform high-resolution encoding of old frames wastes massive computational resources.
Spatial Topology Conflict (Current Frame Redundancy): GUI screenshots are sparse (often >60% background), but naive token pruning disrupts the 2D grid structure required for precise coordinate grounding. Unstructured pruning leads to "spatial hallucinations" (incorrect coordinate predictions) because it destroys the relative positioning of UI elements.

2. Methodology: GUIPruner

The authors propose GUIPruner, a training-free, plug-and-play framework that synergizes two modules to compress visual tokens while preserving structural integrity.

A. Temporal-Adaptive Resolution (TAR)

Goal: Eliminate pixel-level redundancy in historical frames.
Mechanism: Mimics biological "fading memory" by allocating a global token budget across the temporal dimension.
- Linear Decay: It assigns higher resolution (token quotas) to recent frames and progressively lower resolution to older frames based on a decay factor ( $\gamma$ ).
- Source-Level Reduction: Instead of pruning tokens after encoding, TAR resizes frames before they enter the vision encoder. This reduces the computational load at the source, preventing the generation of unnecessary tokens for distant history.

B. Stratified Structure-aware Pruning (SSP)

Goal: Compress the current frame while maintaining the 2D topological grid necessary for coordinate grounding.
Mechanism: Operates within the shallow layers of the Multimodal Large Language Model (MLLM) using a Stratified Budget Allocation Strategy:
1. Foreground Salience Preservation: Prioritizes and retains high-attention tokens corresponding to interactive elements (buttons, input fields) identified via edge detection.
2. Background Semantic Retention: Retains a subset of background tokens with high attention scores to preserve semantic context (e.g., layout boundaries).
3. Topological Structure Completion (Uniform Grid Sampling): The remaining budget is used to sample tokens via a Uniform Grid Sampling (UGS) strategy. This ensures a coarse-grained, deterministic spatial skeleton is maintained, preventing "spatial hallucinations" even when specific background details are removed.

3. Key Contributions

Problem Identification: The paper formally identifies and analyzes "Temporal Decay" in history and the "Sparsity-Topology Conflict" in current frames as the root causes of inefficiency in existing GUI agents.
Novel Framework: Introduction of GUIPruner, a training-free framework that dynamically aligns visual encoding with the agent's cognitive patterns without requiring parameter updates.
Topological Preservation: The development of SSP, which uniquely combines semantic salience with a structural safeguard (Uniform Grid Sampling) to solve the coordinate grounding failure common in general-purpose pruning methods.

4. Experimental Results

The framework was evaluated on four diverse benchmarks (AITW, Mind2Web, GUI-Odyssey, AndroidControl) using Qwen2-VL-2B and Qwen2.5-VL-7B.

Performance:
- GUIPruner achieves State-of-the-Art (SOTA) performance across all datasets.
- On Mind2Web (a challenging web benchmark), it effectively prevents the "catastrophic collapse" seen in other methods (e.g., DivPrune and CDPruner dropped to ~7% accuracy on Qwen2.5-VL-7B, while GUIPruner maintained ~34.7%).
- It retains >94% of the original model's performance even under aggressive compression.
Efficiency:
- FLOPs Reduction: Achieved a 3.4× reduction in FLOPs on Qwen2-VL-2B.
- Latency Speedup: Delivered a 3.3× speedup in vision encoding latency and a 1.9× speedup in prefill time.
- Memory: Reduced peak GPU memory usage significantly (e.g., from ~8.9GB to ~5.9GB on Qwen2-VL-2B).
Ablation Studies:
- Confirmed that TAR outperforms uniform scaling by aligning with the "recency bias."
- Confirmed that Uniform Grid Sampling in SSP is critical; replacing it with random sampling caused significant performance drops due to spatial misalignment.
- Identified Layer 2 of the MLLM as the optimal depth for pruning; pruning at Layer 1 caused performance collapse due to immature attention patterns.

5. Significance

Real-Time Deployment: By drastically reducing FLOPs and latency, GUIPruner enables the deployment of high-resolution, real-time GUI agents on resource-constrained edge devices (e.g., consumer GPUs).
Scalability: The method is robust across model scales (2B to 7B parameters), solving the specific issue where larger models suffer from performance collapse under aggressive token pruning.
Generalizability: As a training-free, plug-and-play solution, it can be applied to existing MLLMs without the need for expensive retraining or fine-tuning, making it a practical solution for the next generation of autonomous GUI agents.

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

1. The "Fading Memory" Trick (For Old Screens)

2. The "Blueprint" Trick (For the Current Screen)

The Result: A Super-Efficient Robot

1. Problem Statement

2. Methodology: GUIPruner

A. Temporal-Adaptive Resolution (TAR)

B. Stratified Structure-aware Pruning (SSP)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

IC3-Evolve: Proof-/Witness-Gated Offline LLM-Driven Heuristic Evolution for IC3 Hardware Model Checking

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space