Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

Imagine you are trying to describe a two-hour movie to a friend, but you only have enough time to say a few sentences. If you try to describe every single frame, every background detail, and every movement, you'll run out of time before you even get to the plot.

This is exactly the problem Video Large Language Models (VLLMs) face today. These are super-smart AI computers that can "watch" videos and answer questions about them. However, a video is made of thousands of frames, and each frame is broken down into thousands of tiny pieces of data called "tokens." Trying to process all of them is like trying to drink from a firehose—it's slow, expensive, and wastes a lot of energy.

Current methods try to fix this by simply throwing away the "boring" parts of the video or gluing similar parts together. But this is like throwing away the background scenery in a movie because it looks "static," only to realize later that the background held a crucial clue to the mystery.

This paper introduces a new method called AOT (Anchors via Optimal Transport). Here is how it works, using some everyday analogies:

1. The Problem: The "Firehose" of Information

Think of a video as a massive library of books (tokens). Most of these books are just copies of the same story with slightly different fonts.

Old Methods: They pick a few random books, throw the rest in the trash, or glue 100 copies of the same book into one thick, messy volume. The result? You lose the subtle details, and the AI gets confused.

2. The Solution: The "Smart Curator" (AOT)

Instead of just deleting or gluing, AOT acts like a super-smart museum curator. It doesn't just pick the "best" paintings; it figures out how to take the essence of the paintings it's removing and paint that essence onto the ones it keeps.

Here is the step-by-step process:

Step A: Picking the "Anchors" (The VIPs)

First, the AI looks at a single frame of the video. It needs to decide which pieces of information are the most important to keep.

The Local Anchor: It looks at small neighborhoods in the image (like a grid) to make sure it keeps a bit of everything (the sky, the person, the table).
The Global Anchor: It looks at the whole picture to see what the "main character" or the most important object is.
The Result: It selects a small group of "VIP tokens" (Anchors) to stay. These are the representatives of the scene.

Step B: The "Moving Company" (Optimal Transport)

This is the magic part. The AI has to deal with all the tokens it didn't pick (the ones it was going to delete).

Old Way: Just delete them. Poof. Gone.
AOT Way: It uses a mathematical tool called Optimal Transport. Imagine the unselected tokens are moving trucks full of valuable cargo (information), and the selected "Anchor" tokens are warehouses.
The AI calculates the most efficient route to move the cargo from the trucks to the warehouses. It doesn't just dump the cargo; it carefully blends the information from the "trash" tokens into the "VIP" tokens.
The Metaphor: If you have a bucket of water (the video) and you want to pour it into a smaller cup (the compressed video), you don't just throw away half the water. You use a special funnel (Optimal Transport) to ensure every drop of flavor from the discarded water is perfectly transferred into the cup.

Step C: The "Time Traveler" (Inter-Frame)

Videos move! The next step is handling the time between frames.

Imagine a video of a person walking. Frame 1 and Frame 2 look almost identical.
AOT treats the first frame of a short clip as the "Captain." It asks the Captain: "Did anything change in the next frame?"
If the next frame is just the Captain taking a step, AOT says, "Okay, that's just a small update," and blends that update into the Captain's data.
If the next frame shows the Captain suddenly turning into a superhero, AOT says, "Whoa! That's a big change!" and keeps that new frame as a separate, special token to preserve the action.

Why is this a Big Deal?

Efficiency: It reduces the amount of data the AI has to process by 90%. It's like shrinking a 4K movie down to a tiny file without losing the plot.
Speed: Because there is less data, the AI answers questions much faster and uses less electricity.
Quality: Unlike other methods that make the video "blurry" or "forgetful," this method keeps the temporal fidelity (the flow of time) and visual fidelity (the details) intact.

The Bottom Line

This paper teaches the AI how to be a better editor. Instead of just cutting scenes out of a movie, it teaches the AI how to summarize the cut scenes and weave their most important details into the scenes that remain.

The result? A video AI that is fast, cheap to run, and still remembers exactly what happened in the video, even when it's only looking at a tiny fraction of the original data.

1. Problem Statement

Video Large Language Models (VLLMs) excel at understanding complex video content but suffer from severe computational inefficiency due to the massive number of visual tokens generated by long video sequences.

Redundancy: Existing methods primarily address intra-frame spatial redundancy (merging similar tokens within a single frame) or perform shallow-layer pruning inside the LLM.
Limitations: These approaches often fail to exploit inter-frame temporal dependencies, leading to suboptimal spatiotemporal reduction. Furthermore, they tend to simply discard or naively merge tokens, inadvertently losing subtle yet informative semantic context and temporal dynamics.
Training Cost: Many existing compression methods require extensive model fine-tuning or training, which is computationally expensive and limits scalability.

2. Methodology: AOT (Local-Global Optimal Transport)

The authors propose AOT, a training-free framework that reduces token redundancy by aggregating informative contexts from pruned tokens into a compact set of "token anchors" using Optimal Transport (OT). The method operates in two phases:

A. Token Anchor Establishment (Local-Global Selection)

Before pruning, the system establishes a set of "anchors" for each frame to serve as the destination for aggregated information.

Global Anchors: Selected based on attention scores from the [CLS] token (or self-attention for models without [CLS]) in the final layers to capture global semantic importance.
Local Anchors: Selected via a grid-wise approach from shallow layers to preserve fine-grained spatial details and local priors.
Result: A union of global and local anchors ( $X_{anchors}$ ) ensures the remaining tokens are semantically important and spatially diverse.

B. Intra-Frame Pruning (Spatial Optimization)

For each frame, the unselected tokens ( $X_{unanchors}$ ) are aggregated onto the anchors ( $X_{anchors}$ ) using Optimal Transport.

Formulation: The anchors and unselected tokens are treated as two discrete probability distributions.
Cost Matrix: Defined by the inverse cosine similarity between tokens ( $C = 1 - \text{sim}(X_a, X_u)$ ).
Transport Plan: The system solves for an optimal transport plan ( $T^*$ ) using the Sinkhorn-Knopp iteration (an efficient, differentiable approximation).
Aggregation: Unselected tokens act as "suppliers" of context, and anchors act as "demanders." The unselected tokens are weighted and merged into the anchors based on the transport mass, ensuring no informative context is lost.

C. Inter-Frame Pruning (Temporal Optimization)

To handle temporal redundancy, the video is segmented into clips.

Keyframe Anchors: The first frame of a clip serves as the initial temporal anchor.
Temporal Aggregation: Subsequent frames are compared against the current accumulated anchors.
- Similar Tokens: If a token in a subsequent frame is highly similar to an anchor (low transport cost), it is aggregated into the anchor, updating the anchor's representation.
- Dynamic Tokens: If a token shows significant temporal change (high transport cost), it is preserved as a distinct token to maintain temporal dynamics.
Outcome: This process compresses the video into a compact set of spatiotemporal anchors while retaining critical motion information.

3. Key Contributions

Novel Perspective on Token Reduction: Unlike previous methods that simply discard or merge tokens, AOT investigates how to aggregate subtle, informative semantics from removed tokens into the remaining anchors via Optimal Transport.
Local-Global Anchor Strategy: The authors propose a dual-selection mechanism (local grid-wise + global attention-guided) to establish robust token anchors that balance semantic importance and spatial diversity.
Training-Free Spatiotemporal Optimization: The method utilizes Optimal Transport to compress tokens across both spatial (intra-frame) and temporal (inter-frame) dimensions without requiring any model fine-tuning.
Efficient Implementation: The use of the Sinkhorn-Knopp iteration allows for fast, differentiable optimization with negligible computational overhead (less than 1% of total inference time).

4. Experimental Results

The method was evaluated on LLaVA-OneVision-7B and LLaVA-Video-7B across four benchmarks: MVBench, EgoSchema, LongVideoBench, and VideoMME.

Performance Preservation:
- AOT reduced the computational cost to 8.3% of the original FLOPs (pruning 90% of tokens) while retaining 97.6% of the original model's performance on LLaVA-OneVision.
- On LLaVA-Video, it achieved a reduction to 15% of original FLOPs while retaining 95.1% performance.
Comparison: AOT significantly outperformed state-of-the-art training-free baselines (e.g., FastV, VisionZip, DyCoke, PruneVid), especially under aggressive token retention budgets (10-15%).
Long-Video Scaling: The method demonstrated superior robustness when processing videos with increasing frame counts (up to 128 frames), maintaining context length efficiency where vanilla models hit context limits.
Efficiency: The OT computation (Intra + Inter) takes approximately 2.11ms per video, contributing less than 1% to the total inference time.

5. Significance

Efficiency vs. Fidelity: AOT demonstrates that aggressive token compression does not necessarily lead to performance degradation if the discarded information is intelligently aggregated rather than discarded.
Training-Free Scalability: By avoiding the need for fine-tuning, AOT offers a practical, plug-and-play solution for deploying VLLMs on resource-constrained hardware or for processing long-context videos.
Theoretical Insight: The application of Optimal Transport to video token compression provides a new mathematical framework for modeling spatiotemporal redundancy, bridging the gap between distribution matching and deep learning efficiency.

In summary, AOT redefines video token reduction by treating it as an optimization problem where "waste" is converted into "value" through global context aggregation, enabling highly efficient VLLM inference without sacrificing visual or temporal fidelity.