InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

InfoTok introduces an information-theoretic framework and a transformer-based adaptive compressor that dynamically allocates video tokens based on informational richness, achieving state-of-the-art compression rates while preserving performance.

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu

Published 2026-03-24
📖 4 min read☕ Coffee break read

Imagine you are trying to send a long, complex movie to a friend over a very slow internet connection.

The Old Way (Fixed Tokenizers):
Currently, most video AI systems treat every part of the movie the same. Imagine you have a rule: "No matter what happens in the movie, you must send 100 postcards to describe every single second."

  • The Problem: If the movie shows a dog sleeping peacefully for an hour, you are wasting 90 postcards describing a dog that isn't moving. If the movie suddenly shows a chaotic cat fight, you only have 10 postcards left for the most exciting part, so you have to skip all the important details.
  • The Result: You either waste bandwidth on boring parts or lose crucial details on exciting parts. It's inefficient.

The New Way (InfoTok):
The paper introduces InfoTok, a smart system inspired by a famous math theory from the 1940s (Shannon's Information Theory). Instead of sending a fixed number of postcards, InfoTok acts like a smart editor who decides how much detail to send based on how "busy" the scene is.

Here is how it works, using simple analogies:

1. The "Smart Router" (The Editor)

Think of the video as a stream of water.

  • Boring scenes (like a sleeping dog or a static wall) are like a calm, slow-moving stream. They don't need much water to describe them.
  • Exciting scenes (like a car crash or a dance battle) are like a raging waterfall. They need a lot of water to capture the chaos.

InfoTok has a Smart Router that looks at the video and asks: "How much information is actually happening right now?"

  • If the scene is boring, the router says, "Send only 20 postcards."
  • If the scene is chaotic, the router says, "Send 80 postcards!"

This ensures you never waste space on boring parts, and you always have enough space for the exciting parts.

2. The "Adaptive Compressor" (The Packing Expert)

Once the router decides how many postcards to send, the Adaptive Compressor gets to work.

  • Imagine you have a suitcase full of items (the video data).
  • The compressor looks at every item and asks, "Is this item important?"
  • It keeps the most "information-rich" items (the moving cat, the changing light) and throws away the redundant ones (the static background, the sleeping dog's fur that didn't move).
  • It then packs only the essential items into the number of postcards the router allowed.

3. The "ELBO" (The Crystal Ball)

How does the computer know what is "important" without watching the whole video first?
The paper uses a mathematical trick called ELBO (Evidence Lower Bound). Think of this as a Crystal Ball that predicts how hard it will be to guess the next frame of the video.

  • If the Crystal Ball says, "It's very easy to guess what happens next (because the dog is sleeping)," the system knows it doesn't need many tokens.
  • If the Crystal Ball says, "It's impossible to guess what happens next (because the cat just jumped)," the system knows it needs many tokens to describe the surprise.

Why is this a big deal?

The researchers tested this on real videos and found:

  1. It saves space: They could cut the number of "postcards" (tokens) by 20% without losing any picture quality.
  2. It's faster: Old methods tried to guess the right amount of space by trial and error (sending a few, checking, sending more, checking again). InfoTok just knows immediately. It is 2.3 times more efficient than previous smart methods.
  3. It's smarter: It doesn't just compress; it understands complexity. A video of a still painting gets compressed heavily; a video of a soccer game gets compressed lightly to keep all the action.

The Bottom Line

InfoTok is like upgrading from a rigid, one-size-fits-all shipping box to a smart, shape-shifting suitcase. It automatically expands when you have a lot of stuff to pack and shrinks when you have little, ensuring you never run out of space for the important stuff and never waste space on the boring stuff. This makes AI video processing faster, cheaper, and capable of handling much longer movies.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →