Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

This paper proposes AOT, a training-free method that optimizes token reduction in Video Large Language Models by establishing local and global token anchors and aggregating informative contexts via optimal transport to efficiently eliminate redundancy while preserving spatiotemporal fidelity.

Jinlong Li, Liyuan Jiang, Haonan Zhang, Nicu Sebe

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are trying to describe a two-hour movie to a friend, but you only have enough time to say a few sentences. If you try to describe every single frame, every background detail, and every movement, you'll run out of time before you even get to the plot.

This is exactly the problem Video Large Language Models (VLLMs) face today. These are super-smart AI computers that can "watch" videos and answer questions about them. However, a video is made of thousands of frames, and each frame is broken down into thousands of tiny pieces of data called "tokens." Trying to process all of them is like trying to drink from a firehose—it's slow, expensive, and wastes a lot of energy.

Current methods try to fix this by simply throwing away the "boring" parts of the video or gluing similar parts together. But this is like throwing away the background scenery in a movie because it looks "static," only to realize later that the background held a crucial clue to the mystery.

This paper introduces a new method called AOT (Anchors via Optimal Transport). Here is how it works, using some everyday analogies:

1. The Problem: The "Firehose" of Information

Think of a video as a massive library of books (tokens). Most of these books are just copies of the same story with slightly different fonts.

  • Old Methods: They pick a few random books, throw the rest in the trash, or glue 100 copies of the same book into one thick, messy volume. The result? You lose the subtle details, and the AI gets confused.

2. The Solution: The "Smart Curator" (AOT)

Instead of just deleting or gluing, AOT acts like a super-smart museum curator. It doesn't just pick the "best" paintings; it figures out how to take the essence of the paintings it's removing and paint that essence onto the ones it keeps.

Here is the step-by-step process:

Step A: Picking the "Anchors" (The VIPs)

First, the AI looks at a single frame of the video. It needs to decide which pieces of information are the most important to keep.

  • The Local Anchor: It looks at small neighborhoods in the image (like a grid) to make sure it keeps a bit of everything (the sky, the person, the table).
  • The Global Anchor: It looks at the whole picture to see what the "main character" or the most important object is.
  • The Result: It selects a small group of "VIP tokens" (Anchors) to stay. These are the representatives of the scene.

Step B: The "Moving Company" (Optimal Transport)

This is the magic part. The AI has to deal with all the tokens it didn't pick (the ones it was going to delete).

  • Old Way: Just delete them. Poof. Gone.
  • AOT Way: It uses a mathematical tool called Optimal Transport. Imagine the unselected tokens are moving trucks full of valuable cargo (information), and the selected "Anchor" tokens are warehouses.
  • The AI calculates the most efficient route to move the cargo from the trucks to the warehouses. It doesn't just dump the cargo; it carefully blends the information from the "trash" tokens into the "VIP" tokens.
  • The Metaphor: If you have a bucket of water (the video) and you want to pour it into a smaller cup (the compressed video), you don't just throw away half the water. You use a special funnel (Optimal Transport) to ensure every drop of flavor from the discarded water is perfectly transferred into the cup.

Step C: The "Time Traveler" (Inter-Frame)

Videos move! The next step is handling the time between frames.

  • Imagine a video of a person walking. Frame 1 and Frame 2 look almost identical.
  • AOT treats the first frame of a short clip as the "Captain." It asks the Captain: "Did anything change in the next frame?"
  • If the next frame is just the Captain taking a step, AOT says, "Okay, that's just a small update," and blends that update into the Captain's data.
  • If the next frame shows the Captain suddenly turning into a superhero, AOT says, "Whoa! That's a big change!" and keeps that new frame as a separate, special token to preserve the action.

Why is this a Big Deal?

  • Efficiency: It reduces the amount of data the AI has to process by 90%. It's like shrinking a 4K movie down to a tiny file without losing the plot.
  • Speed: Because there is less data, the AI answers questions much faster and uses less electricity.
  • Quality: Unlike other methods that make the video "blurry" or "forgetful," this method keeps the temporal fidelity (the flow of time) and visual fidelity (the details) intact.

The Bottom Line

This paper teaches the AI how to be a better editor. Instead of just cutting scenes out of a movie, it teaches the AI how to summarize the cut scenes and weave their most important details into the scenes that remain.

The result? A video AI that is fast, cheap to run, and still remembers exactly what happened in the video, even when it's only looking at a tiny fraction of the original data.