HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

Imagine you have a very smart, but slightly slow, assistant (a Multimodal Large Language Model, or MLLM) who is trying to solve a puzzle. You give this assistant a giant photo (the visual input) and a question (the text input).

The problem is that the photo is made up of thousands of tiny puzzle pieces (called tokens). To understand the photo, the assistant has to look at every single piece, one by one, and compare it to every other piece. This is like trying to read a book by comparing every letter to every other letter in the room—it takes forever and uses up a massive amount of energy.

Current methods try to speed this up by just throwing away some puzzle pieces early on. But the authors of this paper, HiDrop, realized that these methods are throwing away the wrong pieces at the wrong times. They are like a chef who throws away the fresh vegetables before they've even been chopped, or keeps stirring a pot long after the soup is done.

Here is how HiDrop fixes this, using three simple ideas:

1. The "Late Arrival" Strategy (Late Injection)

The Problem: Imagine you are in a meeting. The first few minutes are just people sitting down, checking their phones, and getting coffee. The actual work doesn't start until everyone is settled.
The Old Way: Current models try to process the "visual puzzle pieces" from the very first second of the meeting, even though no one is listening yet. It's a waste of time.
The HiDrop Fix: HiDrop says, "Let the assistant ignore the photo completely until the meeting actually starts." It waits until the very moment the "work" begins (the middle layers of the model) before bringing the photo in. This saves a huge amount of energy because the assistant isn't wasting time looking at the photo while it's just "getting coffee."

2. The "Smart Pyramid" (Concave Pyramid Pruning)

The Problem: Once the meeting starts, the team needs to look at the photo. But looking at every single puzzle piece is still too slow. Old methods say, "Let's throw away 10% of the pieces every 5 minutes." This is too rigid. Sometimes you need to throw away a lot quickly; sometimes you need to be careful.
The HiDrop Fix: HiDrop uses a "Smart Pyramid" approach.

Early in the meeting: The team realizes, "Wow, 90% of these puzzle pieces are just blue sky or empty floor. They aren't important!" They quickly toss those away.
Later in the meeting: As they get to the interesting parts (the faces, the objects), they slow down and only toss away the truly useless pieces.
The Result: They keep the "good" pieces and dump the "bad" ones much faster and more intelligently than before, like a funnel that gets narrower exactly where it needs to.

3. The "Early Exit" (Early Exit)

The Problem: Imagine the team has figured out the puzzle. They know the answer. But they keep staring at the photo for another hour just because the meeting schedule says so.
The Old Way: The model keeps processing the photo until the very end, even when the answer is already obvious.
The HiDrop Fix: HiDrop has a "Stop Sign." Once the team has combined the photo and the question to form a clear idea (usually in the middle of the process), HiDrop says, "Great job! You don't need to look at the photo anymore." It throws the rest of the photo away and lets the assistant finish the job using only their memory of the photo. This is like leaving a party early once you've said your goodbyes.

The Secret Sauce: "Persistent Name Tags"

When you start throwing pieces away, it gets confusing. "Which piece was number 5? Is it still there?"
HiDrop gives every puzzle piece a permanent name tag (Positional Encoding) that never changes, even if the piece is moved or hidden. This ensures the assistant never gets lost or confused about where things are, even as the pile of pieces shrinks.

The Result?

By using these three tricks, HiDrop is like turning a slow, heavy truck into a sleek sports car.

Speed: It trains the model 1.7 times faster.
Efficiency: It throws away 90% of the visual data (the puzzle pieces) without losing any accuracy.
Smarts: It understands when to look at the picture and when to stop, rather than just blindly processing everything.

In short, HiDrop teaches the AI to be lazy in the right places (ignoring the photo when it's not needed) and efficient in the right places (quickly filtering out the noise), making it faster, cheaper, and just as smart as before.

1. Problem Statement

Multimodal Large Language Models (MLLMs) face a critical computational bottleneck due to the quadratic scaling of self-attention mechanisms with respect to the number of tokens. Visual encoders typically generate a large number of tokens (e.g., 576 for a standard image) compared to text tokens, leading to prohibitive training and inference costs.

While progressive vision token pruning (gradually removing tokens as they pass through layers) is a common solution, the authors identify two fundamental flaws in existing approaches:

Misinterpretation of Shallow Layers: Existing methods assume shallow layers are critical for early cross-modal fusion and must be preserved. The authors argue these layers are actually "passive propagators" where visual tokens undergo negligible transformation and cross-modal influence.
Rigid Pruning Schedules: Current methods use fixed-ratio, uniform, or convex/pyramid schedules (e.g., linear decay) that fail to account for the non-uniform flow of visual information. They miss the opportunity to aggressively prune during the "fusion peak" and fail to recognize that deep layers eventually become language-dominant, rendering visual tokens redundant.

2. Methodology: HiDrop Framework

HiDrop is a hierarchical framework that aligns token pruning with the actual internal dynamics of MLLMs. It divides the model into three stages, each with a specific strategy:

A. Shallow Layers: Late Injection

Insight: Analysis shows that in the initial layers (e.g., layers 1–8 in LLaVA-1.5), visual tokens act as "attention sinks" and undergo almost no intra-modal refinement or cross-modal interaction.
Strategy: Instead of processing visual tokens from the first layer, HiDrop employs Late Injection. It bypasses the first $L_{inj}-1$ layers entirely for the visual stream. Visual tokens are only injected at layer $L_{inj}$ (identified as the local minimum in inter-layer similarity), precisely when active fusion begins.
Benefit: This eliminates wasteful computation in the early stages where visual tokens are effectively ignored by the model's reasoning process.

B. Middle Layers: Concave Pyramid Pruning with Early Exit

Insight: The middle layers are the primary hubs for cross-modal fusion but also the peak of redundancy. A small subset of tokens drives the fusion, making the vast majority of others redundant.
Strategy:
- Concave Pyramid Pruning: Unlike linear decay, this scheme accelerates token reduction at the start of the fusion stage and slows down later, preserving essential information while maximizing savings.
- Filtering Layer Selection (ILVAS): The authors introduce Inter-Layer Visual Attention Similarity (ILVAS) to identify optimal "filtering layers." These are layers where the model's assessment of token importance stabilizes, making them ideal points to prune.
- Differentiable Top-K Selection: To avoid the suboptimal selection of non-differentiable Hard Top-K, HiDrop uses a Differentiable Top-K (DTop-K) operator. This allows for continuous relaxation and end-to-end optimization of token selection based on importance scores.
- Early Exit: Once the fusion stage is complete (identified where performance plateaus in deep-to-shallow masking), all remaining visual tokens are discarded (Early Exit), and the model continues with text-only reasoning.

C. Implementation Optimizations

To ensure these dynamic changes do not introduce hidden overheads, HiDrop incorporates:

Persistent Positional Encoding: Visual tokens are assigned fixed positional identifiers at input. Even if tokens are dropped or injected late, their indices remain consistent to prevent RoPE (Rotary Positional Embedding) misalignment.
FlashAttention Compatibility: Token selection is decoupled from the main attention computation. A lightweight auxiliary pass handles selection, allowing the main attention kernel (e.g., FlashAttention) to operate on the full sequence without modification, preserving efficiency.
Parallel Decoupling: Vision-related computations (encoder + projector) are parallelized with the text-only prefill phase before the injection layer, reducing critical path latency.

3. Key Contributions

Diagnosis of MLLM Dynamics: The paper provides empirical evidence that shallow layers are passive propagators and deep layers are language-dominant, challenging the prevailing assumption that visual tokens must be present from the first layer.
HiDrop Framework: Introduces a novel three-stage strategy:
- Late Injection: Delays visual input to the start of active fusion.
- Concave Pyramid Pruning: Adaptive, non-uniform pruning in the middle layers guided by ILVAS and DTop-K.
- Early Exit: Discards visual tokens once fusion is complete.
Algorithmic Innovations: Proposes ILVAS for identifying stable pruning layers and a Differentiable Top-K operator for adaptive, trainable token selection.
System Efficiency: Solves implementation challenges (position encoding, kernel compatibility) to ensure theoretical efficiency translates to real-world speedups.

4. Experimental Results

Experiments were conducted on LLaVA-1.5 with various backbones (MobileLLaMA-2.7B, Vicuna-7B, Vicuna-13B) across 11 benchmarks (including MME, MMBench, GQA, MMStar).

Performance: HiDrop compresses ~90% of visual tokens (retaining only ~64 tokens on average) while maintaining 98.3% of the original model's performance on average.
Comparison: It significantly outperforms state-of-the-art methods like PDrop, FastV, and TwigVLM. For instance, at an 88.9% pruning ratio, HiDrop achieves a 4.1% higher average performance than PDrop.
Efficiency Gains:
- Training: Reduces training time by 40.7% (from 159.3 to 94.4 GPU hours for LLaVA-1.5-7B), resulting in a 1.72× speedup.
- Inference: Reduces FLOPs by 88.9% (from 3.82T to 0.42T).
- Latency: Reduces prefill latency from 63.6ms to 32.6ms (and down to 28.8ms with further optimizations).

5. Significance

New State-of-the-Art: HiDrop sets a new benchmark for the efficiency-accuracy trade-off in MLLM training and inference.
Theoretical Insight: It shifts the paradigm from "how to prune" to "when to prune," providing a principled understanding of how MLLMs process multimodal information hierarchically.
Scalability: By reducing the quadratic cost of vision tokens, HiDrop makes high-resolution image processing and large-scale MLLM training more accessible and sustainable.
Reproducibility: The authors release the code and provide extensive ablation studies, confirming the robustness of their design choices across different model sizes and architectures.