When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs

Imagine you have a very smart robot assistant (a Vision Large Language Model, or VLLM) that can look at a photo and answer questions about it. To do this, the robot breaks the photo down into hundreds of tiny puzzle pieces called "tokens."

The problem? The robot tries to look at every single piece of the puzzle, even the ones that are just blank sky or blurry grass. This makes the robot slow, tired, and expensive to run.

To fix this, scientists invented "Token Pruning." This is like telling the robot: "Hey, ignore the boring pieces and just focus on the important ones!"

But here is the twist this paper discovered: Sometimes, trying to be too smart actually makes the robot dumber.

The Big Discovery: The "Information Horizon"

The researchers found that in the deeper parts of the robot's brain (the later layers of its processing), all the puzzle pieces start to look the same.

Think of it like listening to a song on repeat.

Early in the song (Shallow Layers): You hear the drums, the guitar, and the singer's voice. Each instrument is distinct and important. If you mute the drums, the song sounds different.
Late in the song (Deep Layers): The music has faded into a long, uniform hum. Every note sounds exactly the same. If you mute one note, it doesn't matter because they all sound like background noise.

The paper calls this point the "Information Horizon."

Before the Horizon: The robot needs specific pieces of the image to understand the scene. Smart pruning works great here.
After the Horizon: The visual information has "vanished." Every remaining token is just redundant noise. At this point, it doesn't matter which pieces you throw away.

The "Random Pruning" Surprise

The paper's most surprising finding is this: Once you pass the Information Horizon, picking tokens to remove at random works just as well as using complex, fancy algorithms.

Imagine you are packing a suitcase for a trip.

At the start: You carefully pick your clothes, shoes, and toiletries. You use a smart system to decide what's essential.
At the end: You are just stuffing in empty air. It doesn't matter if you throw in a sock or a tissue; the suitcase is already full of "nothing."

The researchers found that existing "smart" pruning methods were wasting energy trying to find the "best" pieces to cut out in the deep layers, but since all pieces were equally useless, they were doing no better than just closing their eyes and picking random pieces to delete.

Why Does This Matter?

The paper shows that the "Horizon" isn't the same for everyone. It depends on two things:

How hard the task is:
- If you ask, "Is this a baseball field?" (Easy task), the robot figures it out quickly. The Horizon is reached early.
- If you ask, "Read this tiny text on a beer label" (Hard task/OCR), the robot needs to look deeper. The Horizon is pushed further back.
How smart the robot is:
- A super-smart robot (like Qwen-2.5-VL) can find useful clues in deeper layers than a weaker robot (like LLaVA-1.5).

The Solution: The "Hybrid" Approach

Instead of trying to be perfect, the authors suggest a simple, hybrid strategy:

Early Layers: Use the smart, fancy algorithms to keep the most important visual pieces.
Deep Layers: Once you hit the "Information Horizon," just randomly delete half the remaining pieces.

The Result?
By mixing "smart selection" with "random deletion," they made the robot faster and cheaper without losing any accuracy. In fact, for some tasks, this simple mix actually performed better than the complex methods alone because it stopped the robot from overthinking the useless parts.

The Takeaway

You don't need a supercomputer to decide what to throw away when everything is already useless. Sometimes, the most efficient way to speed up an AI is to let it take a random guess at the end of the line.

1. Problem Statement

Vision Large Language Models (VLLMs) suffer from high computational costs because they convert images into hundreds or thousands of visual tokens, which dominate the input sequence length and slow down inference. While token pruning (removing redundant visual tokens) is a common strategy to accelerate inference, the authors observe a critical limitation:

The Deep Layer Paradox: In deeper layers of the language decoder (e.g., beyond the 20th layer), existing training-free pruning methods (both importance-based and diversity-based) perform no better than, and sometimes worse than, random pruning.
The Question: Why do sophisticated pruning strategies fail to outperform random selection in deep layers, and do these methods still identify tokens containing necessary information for the final answer?

2. Methodology

The authors propose a new framework to quantify and analyze visual token information to explain this phenomenon.

A. Defining "Visual Token Information"

The paper introduces a metric to measure the information content of a specific visual token $V_k$ at a specific decoder layer $i$ .

Mechanism:
1. Isolation: At layer $i$ , all visual tokens except the target token $V_k$ are masked out. The model runs a forward pass to calculate the probability of the ground-truth answer ( $p_k$ ).
2. Removal: The target token $V_k$ is also removed, forcing the model to rely solely on text tokens. The probability of the ground-truth answer is calculated again ( $p_{text}$ ).
3. Calculation: The information score $I_i(V_k)$ is defined as the difference: $I_i(V_k) = p_k - p_{text}$ .
Validation: Experiments show that removing tokens with low scores based on this metric consistently improves or maintains model performance, validating the metric's effectiveness.

B. The "Information Horizon" Concept

Using the proposed metric, the authors analyze how token information evolves across network layers:

Shallow Layers: Tokens have varying information levels; some are highly salient (e.g., key objects), while others are redundant.
Deep Layers: As the network deepens, the information content of visual tokens becomes uniform and eventually vanishes (approaches zero).
Information Horizon: The specific layer where the average information of all visual tokens drops to near zero is termed the "Information Horizon." Beyond this layer, visual tokens become redundant, and their removal (even randomly) does not impact performance.

C. Dynamic Factors Influencing the Horizon

The position of the Information Horizon is not static; it depends on two factors:

Task Visual Complexity: Tasks requiring fine-grained visual details (e.g., OCR, TextVQA) have a deeper horizon than general tasks (e.g., Knowledge QA, MME).
Model Visual Capability: Stronger models (e.g., Qwen2.5-VL) can extract useful information from deeper layers than weaker models (e.g., LLaVA-1.5), pushing their horizon further back.

3. Key Contributions

Quantification Metric: Proposed a method to quantify visual token information based on output probability changes, demonstrating that low-information tokens can be safely removed.
Discovery of Information Horizon: Identified that visual token information vanishes uniformly in deep layers, creating an "Information Horizon" beyond which tokens are redundant.
Dynamic Nature: Revealed that the horizon's position is dynamic, shifting based on task complexity and model capacity.
Novel Pruning Strategy: Demonstrated that integrating random pruning in deep layers (beyond the horizon) with existing pruning methods in shallow layers yields superior results compared to using sophisticated methods alone or removing all tokens at a fixed layer.

4. Experimental Results

The paper evaluates the approach on LLaVA-1.5-7B and Qwen2.5-VL-7B across multiple benchmarks (MME, ScienceQA, POPE, TextVQA, OCRBench, etc.).

Performance of Random Pruning: In deep layers (e.g., >14th layer for LLaVA), random pruning retains as much information as complex methods like DivPrune or DART because all tokens have near-zero information.
Hybrid Strategy (Existing + Random):
- Qwen2.5-VL-7B: Combining DivPrune + Random pruning achieved 96.9% of the original model's performance while pruning 50% of visual tokens. This outperformed DivPrune alone (96.7%) and DART alone (92.7%).
- LLaVA-1.5-7B: DivPrune + Random improved MMBench accuracy by 6.7% (61.3% vs. 54.6%) compared to DivPrune alone at high pruning ratios.
Comparison with "Visual Token Withdrawal" (VTW): Unlike VTW, which removes all tokens after a fixed layer (causing performance drops on complex tasks), the proposed method retains a small subset of tokens via random pruning in deep layers. This maintains higher accuracy on complex tasks (e.g., OCR) while still reducing latency.
Efficiency: The hybrid approach significantly reduces FLOPs and inference latency. For example, on LLaVA-1.5, DART + Random reduced CUDA latency by 73% and FLOPs by 74.4% while retaining 91.6% of the original performance. It also enables compatibility with FlashAttention, which is often incompatible with attention-based pruning methods.

5. Significance

Theoretical Insight: The paper shifts the perspective from "finding the most important token" to "understanding when visual information becomes irrelevant." It explains why existing pruning methods fail in deep layers (information uniformity/vanishing).
Practical Efficiency: It provides a simple, training-free, and highly effective strategy: use sophisticated pruning in shallow layers and random pruning in deep layers. This balances efficiency and accuracy better than current state-of-the-art methods.
Hardware Compatibility: By utilizing random pruning in deep layers, the method avoids the computational overhead of calculating attention weights for pruning, making it compatible with optimized attention kernels (FlashAttention) for faster inference.

In conclusion, the paper argues that token pruning is not always better than random in deep layers because the tokens themselves have lost their informational value. By identifying the "Information Horizon" and strategically applying random pruning, VLLMs can achieve significant speedups without sacrificing performance.