Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

Imagine you are trying to explain a 2-hour movie to a friend who only has 5 minutes to listen. You can't possibly tell them every single detail, every background extra, and every second of silence. You have to be a smart editor, picking out only the most crucial scenes and lines of dialogue that actually answer your friend's question.

This is exactly the problem the paper CaCoVID solves, but for Artificial Intelligence (AI) watching videos.

The Problem: The AI is Drowning in Data

Modern AI models (called Video Large Language Models) are incredibly smart. They can watch videos and answer questions like, "What is the man wearing?" or "Why did the girl laugh?"

However, to understand a video, the AI breaks it down into thousands of tiny pieces called tokens (think of these as individual puzzle pieces or tiny snapshots).

The Issue: A 1-minute video might generate 6,000+ of these tokens. Processing all of them is like trying to read a whole encyclopedia just to find one specific fact. It's slow, expensive, and wastes a lot of computer power.
The Old Way: Previous methods tried to cut down the tokens by looking at "Attention Scores." Imagine the AI giving a "popularity vote" to every token. The old rule was: "Keep the tokens with the most votes."
The Flaw: The paper points out that popularity doesn't equal importance. Just because a token gets a lot of attention doesn't mean it helps answer the question.
- Analogy: Imagine a detective asking, "Who stole the cookie?" The AI might give a huge "vote" to a token showing a red wall in the background because it's bright and obvious. But the red wall has nothing to do with the cookie! The AI is distracted by the "loud" noise and misses the quiet, critical clue (the cookie jar).

The Solution: CaCoVID (The Smart Editor)

The authors created a new system called CaCoVID. Instead of just looking at which tokens are "loud," they teach the AI to ask: "Which tokens actually help me get the right answer?"

Here is how they did it, using a creative analogy:

1. The Reinforcement Learning "Game"

Instead of hard-coding rules, they taught a small "Policy Network" (a mini-AI coach) how to edit the video.

The Game: The coach is given a video and a question. It has to pick a small group of tokens to keep.
The Reward: If the main AI (using only those picked tokens) gets the answer right, the coach gets a "Gold Star" (Reward). If it gets it wrong, the coach gets a "Red Card."
The Result: The coach learns through trial and error to ignore the "red walls" and focus on the "cookie jar." It shifts from passively keeping popular tokens to actively hunting for the right ones.

2. The "Combinatorial" Challenge (The Needle in a Haystack)

Here is the tricky part. If you have 1,000 tokens and need to pick 100, there are more possible combinations than there are atoms in the universe. Trying every combination is impossible.

The Old Way: Randomly guessing combinations is like trying to find a specific needle in a haystack by throwing darts blindly. You'll never find it.
The CaCoVID Trick (OCSS): They invented a clever sampling strategy called Online Combinatorial Space Sampling.
- Analogy: Imagine the haystack is sorted into 100 small bales. The coach first looks at the bales and guesses which ones might have the needle. Then, it only searches inside those specific bales.
- This drastically shrinks the search space. It stops the AI from wasting time on useless combinations and helps it learn much faster.

Why This Matters

The paper shows that CaCoVID is a game-changer for three reasons:

It's Smarter: It keeps the tokens that actually matter for the specific question, not just the "loud" ones.
It's Faster: By cutting out 75% to 90% of the video data, the AI runs much faster and uses less electricity.
It's Flexible: It works with different types of AI models without needing to retrain the whole massive system. It's like adding a smart filter to a camera lens rather than replacing the whole camera.

The Bottom Line

Think of CaCoVID as a super-efficient film editor for AI.

Old AI: "I'll show you every frame of the movie, even the boring parts, just in case you need them." (Slow, expensive, confusing).
CaCoVID AI: "You asked about the chase scene? Here are the 50 most important frames of the chase. I ignored the scenery and the background music because they don't help you understand the chase." (Fast, cheap, and accurate).

This allows AI to understand long videos in real-time, making it practical for real-world applications like video assistants, security monitoring, and automated content analysis.

1. Problem Statement

Video Large Language Models (Video LLMs) have achieved significant success in video understanding but face severe computational bottlenecks during inference.

Redundancy: Videos contain dense tokens (e.g., 32 frames $\times$ 196 tokens = 6,272 tokens), leading to quadratic computational complexity ( $O(n^2)$ ) in self-attention mechanisms.
Inefficiency of Existing Methods: Current token compression strategies fall into two categories:
1. Content-based: Prune based on diversity or spatial structure (query-agnostic), potentially removing tokens critical to the specific question.
2. Model-based: Prune based on attention scores (e.g., FastV, PyramidDrop).
The Core Limitation: The paper argues that attention scores do not correlate well with actual contribution to correct answers. High attention scores often stem from "attention sinks" or irrelevant features, while critical visual regions (e.g., specific clothing or objects relevant to a query) may receive low attention. This leads to suboptimal compression where vital information is discarded.

2. Methodology: CaCoVID

The authors propose CaCoVID, a framework that shifts from passive token preservation to active discovery of optimal token combinations using Reinforcement Learning (RL).

A. Compression Policy Network

Instead of using fixed heuristics, CaCoVID trains a small, learnable policy network to estimate the contribution of video tokens and frames to the correct prediction.

Architecture: Consists of a self-attention mechanism and two Multi-Layer Perceptrons (MLPs).
Input: Encoded video tokens ( $X_{vid}$ ) and question tokens ( $X_{qst}$ ).
Mechanism:
1. Cross-modal Interaction: Self-attention establishes interactions between video and text tokens to generate question-aware video tokens.
2. Contribution Estimation: Two MLPs ( $MLP_t$ for tokens, $MLP_f$ for frames) output logits. The difference between the "keep" and "drop" channels represents the estimated contribution score ( $\hat{S}_t, \hat{S}_f$ ).
Inference: The network selects the top-contributing tokens per frame and allocates the token budget across frames based on frame contribution scores.

B. Combinatorial Policy Optimization (CPO)

Since there is no large-scale dataset annotated with "key tokens," the authors use RL. However, the exploration space for $n$ tokens is $2^n$ (combinatorial explosion), making standard sampling infeasible.

Online Combinatorial Space Sampling (OCSS): A novel strategy to constrain the exploration space.
1. Sorting & Partitioning: Tokens are sorted by estimated contribution scores and divided into $l$ combinatorial sub-spaces.
2. Two-Stage Sampling:
  - Stage 1 (Sub-space Selection): A categorical distribution samples a sub-space based on the sum of contribution scores within that sub-space.
  - Stage 2 (Token Selection): Within the selected sub-space, a multinomial distribution samples specific tokens.
- Benefit: This ensures that tokens with similar contribution levels are sampled together, drastically reducing the search space and avoiding the inclusion of "noisy" low-contribution tokens in the same batch.

C. Training Pipeline & Data Efficiency

Reward Signal: The LLM's prediction accuracy (correct/incorrect) against the ground truth answer serves as the reward.
Group Advantage: The policy is optimized using a PPO-like objective (clipped surrogate objective) based on group advantages derived from multiple sampled token combinations per iteration.
Data Exploration Efficiency Strategies:
1. Ineffective Sample Filtering (ISF): Removes questions that can be answered without video input (blind testing).
2. Experience Replay: Reuses training samples multiple times to generate diverse exploration experiences.
3. Dynamic Sample Ratio: Adjusts the token sampling ratio ( $r$ ) dynamically based on the average reward of previous iterations to balance exploration and exploitation.

3. Key Contributions

First RL-based Token Compression for Video: Introduces CaCoVID, the first framework to explicitly optimize token selection based on the contribution to correct predictions rather than attention scores or content diversity.
Combinatorial Policy Optimization (CPO): Proposes the OCSS algorithm, which solves the $O(2^n)$ exploration problem by partitioning the space, reducing the logarithmic exploration complexity by a factor of 25 compared to native sampling.
Framework-Agnostic & Training-Free: The compression policy network is small and trained separately; the underlying Video LLM (e.g., LLaVA-OneVision, Qwen2.5-VL) is frozen, requiring no retraining of the massive model.

4. Experimental Results

The method was evaluated on LLaVA-OneVision-7B and Qwen2.5-VL-3B across three benchmarks: LongVideoBench, MLVU, and VideoMME.

Performance: CaCoVID achieves State-of-the-Art (SOTA) performance across various retention ratios (10%, 15%, 20%, 25%).
- Example: On LLaVA-OneVision with 25% retention, CaCoVID achieves 55.8% average accuracy, outperforming FastV (52.3%), VisionZip (54.6%), and DivPrune (55.0%).
- It consistently outperforms content-based (VisionZip, DivPrune) and model-based (FastV, FrameFusion) methods.
Efficiency:
- Compression Speed: CaCoVID is significantly faster than other methods. On LLaVA-OneVision, compression time is 11.2ms compared to 134.3ms (DivPrune) and 34.1ms (PruneVID).
- Latency: The total inference latency is reduced significantly while maintaining high accuracy.
Ablation Studies:
- OCSS: Removing OCSS (using random or multinomial sampling) leads to significant performance drops, confirming the necessity of constrained exploration.
- Data Efficiency: Filtering blind-testable samples and using dynamic ratios significantly improves model convergence.
- Token Strategy: Adaptive frame allocation (FrameAda) combined with spatio-temporal token preservation (ST) yields the best results.

5. Significance

Paradigm Shift: Moves the field from "what looks important to the model" (attention) to "what is actually useful for the answer" (contribution).
Scalability: By decoupling the compression policy from the LLM and using efficient RL sampling, CaCoVID offers a practical solution for deploying Video LLMs on resource-constrained devices without sacrificing reasoning capabilities.
Generalization: The method works effectively across different model sizes (3B to 7B+) and diverse video benchmarks, proving its robustness in handling long-context and complex video reasoning tasks.

In conclusion, CaCoVID demonstrates that active, contribution-aware token selection via reinforcement learning is a superior strategy for efficient video understanding, overcoming the limitations of static attention-based pruning.