Stateful Token Reduction for Long-Video Hybrid VLMs

Imagine you are trying to watch a 2-hour movie to answer a single question: "What color was the protagonist's hat in the third scene?"

Currently, AI models (Vision-Language Models) try to watch the entire movie frame-by-frame, analyzing every single pixel of every single second. It's like hiring a team of 10,000 detectives to read every word of a 1,000-page book just to find one specific sentence. It's slow, expensive, and most of the detectives are just reading the same boring parts over and over again.

This paper introduces a smarter way to handle these "long movies" using a new type of AI architecture called a Hybrid Model (mixing standard Transformers with a new technology called Mamba). Here is the simple breakdown of their solution:

1. The Problem: The "Too Many Clues" Dilemma

When an AI watches a long video, it turns the video into thousands of tiny "tokens" (like digital puzzle pieces).

The Old Way: Most AI models try to keep all the pieces. If the video is long, the computer chokes.
The "Pruning" Attempt: Previous methods tried to throw away the "boring" pieces early on. But they made a mistake: they threw away pieces too quickly, before the AI had a chance to realize that a "boring" piece in the first minute might be crucial for understanding the plot in the last hour.

2. The Discovery: The "Unreliable Librarian"

The researchers noticed something interesting about how these AI models think:

Layer 1 (The Librarian): When the AI first sees the video, it's not sure which clues are important. It's like a librarian who hasn't read the book yet; if you ask them, "Which pages matter?" they might guess wrong.
Layer 10 (The Expert): As the AI processes the video deeper, it starts to understand the context. Now it knows exactly which clues matter.
The Hybrid Twist: The new "Hybrid" models have a special feature: a short-term memory (the Mamba part). Even if you throw away a piece of the puzzle, this memory keeps a "summary" of it. It's like taking a photo of a page and then throwing the page away; the photo is still in your pocket.

3. The Solution: The "Slow-Release" Strategy

Instead of throwing away 75% of the video tokens immediately (which causes the AI to lose its mind), the authors propose a Progressive Reduction strategy. Think of it like a sieve that gets tighter as you go down:

At the Top (Early Layers): Keep almost all the tokens. Let the AI's "memory" soak up the information. Don't throw anything away yet because the AI isn't sure what's important.
In the Middle: Start gently removing the obvious "boring" stuff.
At the Bottom (Late Layers): Now that the AI has a full understanding of the story, it can confidently say, "Okay, we definitely don't need these 75% of the frames anymore." It keeps only the most critical clues.

4. The "Secret Sauce": A Universal Scorecard

To make this work, they needed a way to decide which tokens to keep.

For standard AI parts, they used a standard method: "What does the text question care about?"
For the new "Memory" parts (Mamba), they invented a translator. They figured out how to ask the memory blocks, "What is important?" in a language they understand. This allows them to prune tokens inside every part of the AI, not just the standard parts.

The Result: Speed Without Losing the Plot

By using this "slow-release" strategy on the Hybrid model:

Speed: The AI processes long videos 4 times faster. It's like going from a slow train to a high-speed bullet train.
Accuracy: It doesn't lose the plot. In fact, because the AI had time to "think" before it started deleting, it actually answers questions better than before, especially on very long videos.
Efficiency: It uses much less computer power and energy.

The Analogy in a Nutshell

Imagine you are packing for a trip.

Old Method: You throw away 75% of your clothes in the first 5 minutes of packing. You might throw away your only warm coat because you thought it was "boring" at the time.
This Paper's Method: You lay out all your clothes. You pack the essentials first. As you get closer to the suitcase closing, you realize, "Oh, I don't need these 5 pairs of socks." You remove them at the end, after you've made sure you have everything you need.

In short: This paper teaches AI how to watch long videos faster by waiting until it understands the story before it starts deleting the "boring" parts, resulting in a super-fast, super-smart video watcher.

1. Problem Statement

Long-video Vision-Language Models (VLMs) face significant computational bottlenecks due to the sheer volume of visual tokens generated by long video inputs (often exceeding 10,000 tokens). This makes inference, particularly the prefilling stage, extremely expensive.

While token reduction (pruning or merging) is a known solution, existing methods are primarily designed for dense Transformer architectures. They struggle with hybrid architectures (e.g., Mamba-Transformer hybrids) that interleave standard attention blocks with linear-time state-space models (SSMs/Mamba).

The Core Challenge: In hybrid models, information propagation differs from pure Transformers. Aggressive early pruning often degrades accuracy because token importance rankings are unstable across layers. Existing methods lack a unified scoring mechanism for non-attention (Mamba) blocks and do not account for the unique "memory" properties of state-space layers.

2. Methodology

The authors propose a Stateful Token Reduction framework specifically tailored for hybrid VLMs. The methodology consists of three main components:

A. Query-Conditioned Token Scoring

To determine which tokens to keep, the authors define an importance score based on the relevance of visual tokens to the text query.

For Attention Layers: They use standard text-to-vision attention weights.
For Mamba (SSM) Layers: Since Mamba lacks explicit attention matrices, the authors derive an implicit-attention proxy. By unrolling the selective-scan recurrence, they extract a weight matrix $w_{t,j}$ $w_{t, j}$ representing the contribution of visual token $j$ $j$ to the output at text position $t$ $t$ .
- They define the score as the absolute dot product between the visual token's effective input projection ( $\bar{b}$ ) and the text token's output projection ( $c$ ), averaging over query positions and Mamba groups.
- Crucial Adjustment: They empirically omit the cumulative decay term found in the raw recurrence to prevent distant but relevant tokens from being unfairly penalized by the model's inherent decay mechanism.

B. Analysis of Sparsity and Stability

The authors analyzed two key properties to understand why hybrid models behave differently:

Sparsity: How concentrated is the importance mass? (Both models show high sparsity, suggesting reduction is feasible).
Cross-Layer Stability: Do the same tokens remain important across layers?
- Finding: Importance rankings are unstable in early layers for both architectures.
- Key Insight: In pure Transformers, early pruning is catastrophic because information is dropped permanently. In Hybrid models, Mamba blocks maintain a recurrent latent state that aggregates information over time. Even if a token is pruned, its "summary" persists in the state. This allows hybrid models to tolerate aggressive reduction better than Transformers, provided the pruning strategy respects the state accumulation process.

C. Progressive Reduction Schedule

Motivated by the instability of early-layer scores and the memory effect of Mamba, the authors propose a Low-to-High Progressive Reduction Schedule:

Early Layers: Preserve more tokens (low reduction) because importance scores are unreliable and the recurrent state has not yet accumulated sufficient information.
Deep Layers: Apply aggressive reduction (high pruning) as the state becomes robust and token rankings stabilize.
Unified Application: The method applies reduction to all layers (both Attention and Mamba), utilizing the unified scoring mechanism described above.

3. Key Contributions

First Token Reduction for Hybrid VLMs: The paper addresses the gap in token reduction methods for Mamba-Transformer hybrid architectures, providing a unified scoring mechanism for both attention and state-space blocks.
Theoretical Insight on Hybrid Dynamics: It identifies that the recurrent memory state in Mamba blocks acts as a compression buffer, allowing hybrid models to tolerate aggressive token pruning that would destroy performance in pure Transformers.
Progressive Scheduling Strategy: It introduces a depth-dependent reduction schedule (preserving tokens early, pruning later) that aligns with the dynamics of state accumulation and importance stability.
Unified Scoring Proxy: It derives an implicit attention proxy for Mamba layers, enabling query-conditioned token selection in non-attention blocks.

4. Experimental Results

The authors evaluated their method using Nemotron-Nano-V2 VL 12B (Hybrid) and Qwen3-VL 8B (Transformer) on long-video benchmarks: VideoMME, LongVideoBench, and LVBench.

Performance vs. Baseline (Hybrid Model):
- Under an aggressive 25% token budget (retaining only 1/4 of tokens), the hybrid model achieved near-baseline accuracy (within 0.1% average drop) without fine-tuning.
- With light fine-tuning under the reduction schedule, the hybrid model outperformed the non-reduced baseline by +1.37% on average.
- Speedup: Achieved 3.8× – 4.2× prefilling speedups (Time to First Token).
Comparison with Pure Transformers:
- Pure Transformers (Qwen3-VL) suffered significant accuracy drops (up to -3.75% average) under the same aggressive 25% reduction, even with fine-tuning.
- This confirms the hypothesis that the stateful memory in hybrid models is the key enabler for aggressive compression.
Efficiency:
- On a 256-frame input, the hybrid model's LLM-stage latency dropped from 2714ms to 1080ms (2.51× speedup).
- The method scales significantly better with video length, avoiding Out-of-Memory (OOM) errors at 512 frames where the baseline failed.

5. Significance

Enabling Long-Video Understanding: This work makes long-horizon video understanding computationally feasible on hybrid architectures, which are increasingly popular for their efficiency.
Architectural Awareness: It demonstrates that token reduction strategies must be architecture-aware. A "one-size-fits-all" pruning approach fails in hybrid models; instead, leveraging the specific memory properties of SSMs is crucial.
State-of-the-Art Performance: The proposed method allows a 12B hybrid model to compete with or exceed much larger (72B) Transformer models on long-video benchmarks while being significantly faster and more memory-efficient.
Practical Deployment: The ability to achieve 4× speedups with negligible accuracy loss makes real-time or near-real-time long-video analysis practical on current hardware.