Stateful Token Reduction for Long-Video Hybrid VLMs

This paper proposes a stateful, progressive token reduction framework with a unified language-aware scoring mechanism for hybrid video VLMs, achieving significant prefilling speedups while maintaining near-baseline accuracy by addressing the layerwise instability of token importance in architectures combining attention and state-space blocks.

Jindong Jiang, Amala Sanjay Deshmukh, Kateryna Chumachenko, Karan Sapra, Zhiding Yu, Guilin Liu, Andrew Tao, Pavlo Molchanov, Jan Kautz, Wonmin Byeon

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are trying to watch a 2-hour movie to answer a single question: "What color was the protagonist's hat in the third scene?"

Currently, AI models (Vision-Language Models) try to watch the entire movie frame-by-frame, analyzing every single pixel of every single second. It's like hiring a team of 10,000 detectives to read every word of a 1,000-page book just to find one specific sentence. It's slow, expensive, and most of the detectives are just reading the same boring parts over and over again.

This paper introduces a smarter way to handle these "long movies" using a new type of AI architecture called a Hybrid Model (mixing standard Transformers with a new technology called Mamba). Here is the simple breakdown of their solution:

1. The Problem: The "Too Many Clues" Dilemma

When an AI watches a long video, it turns the video into thousands of tiny "tokens" (like digital puzzle pieces).

  • The Old Way: Most AI models try to keep all the pieces. If the video is long, the computer chokes.
  • The "Pruning" Attempt: Previous methods tried to throw away the "boring" pieces early on. But they made a mistake: they threw away pieces too quickly, before the AI had a chance to realize that a "boring" piece in the first minute might be crucial for understanding the plot in the last hour.

2. The Discovery: The "Unreliable Librarian"

The researchers noticed something interesting about how these AI models think:

  • Layer 1 (The Librarian): When the AI first sees the video, it's not sure which clues are important. It's like a librarian who hasn't read the book yet; if you ask them, "Which pages matter?" they might guess wrong.
  • Layer 10 (The Expert): As the AI processes the video deeper, it starts to understand the context. Now it knows exactly which clues matter.
  • The Hybrid Twist: The new "Hybrid" models have a special feature: a short-term memory (the Mamba part). Even if you throw away a piece of the puzzle, this memory keeps a "summary" of it. It's like taking a photo of a page and then throwing the page away; the photo is still in your pocket.

3. The Solution: The "Slow-Release" Strategy

Instead of throwing away 75% of the video tokens immediately (which causes the AI to lose its mind), the authors propose a Progressive Reduction strategy. Think of it like a sieve that gets tighter as you go down:

  • At the Top (Early Layers): Keep almost all the tokens. Let the AI's "memory" soak up the information. Don't throw anything away yet because the AI isn't sure what's important.
  • In the Middle: Start gently removing the obvious "boring" stuff.
  • At the Bottom (Late Layers): Now that the AI has a full understanding of the story, it can confidently say, "Okay, we definitely don't need these 75% of the frames anymore." It keeps only the most critical clues.

4. The "Secret Sauce": A Universal Scorecard

To make this work, they needed a way to decide which tokens to keep.

  • For standard AI parts, they used a standard method: "What does the text question care about?"
  • For the new "Memory" parts (Mamba), they invented a translator. They figured out how to ask the memory blocks, "What is important?" in a language they understand. This allows them to prune tokens inside every part of the AI, not just the standard parts.

The Result: Speed Without Losing the Plot

By using this "slow-release" strategy on the Hybrid model:

  • Speed: The AI processes long videos 4 times faster. It's like going from a slow train to a high-speed bullet train.
  • Accuracy: It doesn't lose the plot. In fact, because the AI had time to "think" before it started deleting, it actually answers questions better than before, especially on very long videos.
  • Efficiency: It uses much less computer power and energy.

The Analogy in a Nutshell

Imagine you are packing for a trip.

  • Old Method: You throw away 75% of your clothes in the first 5 minutes of packing. You might throw away your only warm coat because you thought it was "boring" at the time.
  • This Paper's Method: You lay out all your clothes. You pack the essentials first. As you get closer to the suitcase closing, you realize, "Oh, I don't need these 5 pairs of socks." You remove them at the end, after you've made sure you have everything you need.

In short: This paper teaches AI how to watch long videos faster by waiting until it understands the story before it starts deleting the "boring" parts, resulting in a super-fast, super-smart video watcher.