Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

The paper introduces GUIPruner, a training-free framework that employs Temporal-Adaptive Resolution and Stratified Structure-aware Pruning to eliminate spatiotemporal redundancy in high-resolution GUI agents, achieving significant efficiency gains and speedups while maintaining over 94% of original performance.

Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to use your smartphone or navigate a website. You show the robot a series of screenshots (like a comic strip of what happened before) and the current screen, asking it to figure out what to click next.

The problem is that screenshots are huge. If you show the robot 10 or 20 high-resolution images, it gets overwhelmed. It's like trying to read a 500-page novel to find one specific sentence; the robot wastes time reading pages that don't matter, gets tired (slow), and sometimes makes up things that aren't there (hallucinations).

This paper introduces GUIPruner, a smart "editor" that helps the robot focus only on what matters, without needing to be retrained. It solves two main problems using two clever tricks:

1. The "Fading Memory" Trick (For Old Screens)

The Problem:
When you remember a task, you remember the very last thing you did in perfect detail. But you remember what you did 10 minutes ago only as a fuzzy outline.
Existing robots, however, treat every old screenshot with the same high detail as the current one. It's like trying to remember the exact color of your shirt from three years ago while you are trying to solve a math problem right now. It's a waste of brainpower.

The Solution: Temporal-Adaptive Resolution (TAR)
GUIPruner acts like a human with a "fading memory."

  • The Recent Past: It keeps the most recent screenshots in High Definition (crystal clear).
  • The Distant Past: As the screenshots get older, it slowly shrinks them down, turning them into blurry thumbnails.
  • The Analogy: Imagine looking at a long trail of footprints. You look closely at the footprints right next to you to see where you are stepping. But for the footprints from an hour ago, you just glance at the general path. You don't need to count the pebbles in the dirt from an hour ago to know where to step next. This saves a massive amount of energy.

2. The "Blueprint" Trick (For the Current Screen)

The Problem:
A typical app screen is mostly empty space (background) with a few buttons and text boxes (foreground).
If you just randomly delete the "boring" background parts to save space, you might accidentally delete the grid lines that tell the robot where things are.

  • The Danger: If you delete the grid, the robot might think a button is in the top-left corner when it's actually in the bottom-right. This is called a "spatial hallucination"—the robot sees a button that isn't there or clicks the wrong spot.

The Solution: Stratified Structure-aware Pruning (SSP)
GUIPruner acts like a careful architect who knows how to renovate a house without knocking down the load-bearing walls. It keeps three specific things:

  1. The Stars (Foreground): It keeps the interactive buttons and input boxes in high detail. These are the "actors" on stage.
  2. The Context (Important Background): It keeps a few key background clues that help the robot understand the scene (like a menu bar or a logo).
  3. The Blueprint (The Grid): This is the magic part. Even if it deletes most of the empty space, it leaves behind a skeleton grid (like a wireframe).
  • The Analogy: Imagine you are describing a city to a blind person. You don't need to describe every single brick in every building. But you must keep the street grid and the major intersections. GUIPruner keeps the "streets" (the grid) so the robot never gets lost, even if it deletes 70% of the "buildings" (the empty pixels).

The Result: A Super-Efficient Robot

By combining these two tricks, the paper shows that the robot becomes incredibly fast and smart:

  • Speed: It runs 3.3 times faster because it isn't wasting time processing blurry old photos or empty white space.
  • Smarts: It actually makes fewer mistakes than before because it isn't confused by too much noise.
  • No Training Needed: You don't have to teach the robot a new way of thinking. You just plug this "editor" in front of it, and it works immediately.

In short: GUIPruner teaches the robot to remember the recent past clearly, the distant past vaguely, and to always keep the map (the grid) intact so it never gets lost. It's the difference between a robot that is drowning in data and one that is a master navigator.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →