InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Imagine you are trying to send a long, complex movie to a friend over a very slow internet connection.

The Old Way (Fixed Tokenizers):
Currently, most video AI systems treat every part of the movie the same. Imagine you have a rule: "No matter what happens in the movie, you must send 100 postcards to describe every single second."

The Problem: If the movie shows a dog sleeping peacefully for an hour, you are wasting 90 postcards describing a dog that isn't moving. If the movie suddenly shows a chaotic cat fight, you only have 10 postcards left for the most exciting part, so you have to skip all the important details.
The Result: You either waste bandwidth on boring parts or lose crucial details on exciting parts. It's inefficient.

The New Way (InfoTok):
The paper introduces InfoTok, a smart system inspired by a famous math theory from the 1940s (Shannon's Information Theory). Instead of sending a fixed number of postcards, InfoTok acts like a smart editor who decides how much detail to send based on how "busy" the scene is.

Here is how it works, using simple analogies:

1. The "Smart Router" (The Editor)

Think of the video as a stream of water.

Boring scenes (like a sleeping dog or a static wall) are like a calm, slow-moving stream. They don't need much water to describe them.
Exciting scenes (like a car crash or a dance battle) are like a raging waterfall. They need a lot of water to capture the chaos.

InfoTok has a Smart Router that looks at the video and asks: "How much information is actually happening right now?"

If the scene is boring, the router says, "Send only 20 postcards."
If the scene is chaotic, the router says, "Send 80 postcards!"

This ensures you never waste space on boring parts, and you always have enough space for the exciting parts.

2. The "Adaptive Compressor" (The Packing Expert)

Once the router decides how many postcards to send, the Adaptive Compressor gets to work.

Imagine you have a suitcase full of items (the video data).
The compressor looks at every item and asks, "Is this item important?"
It keeps the most "information-rich" items (the moving cat, the changing light) and throws away the redundant ones (the static background, the sleeping dog's fur that didn't move).
It then packs only the essential items into the number of postcards the router allowed.

3. The "ELBO" (The Crystal Ball)

How does the computer know what is "important" without watching the whole video first?
The paper uses a mathematical trick called ELBO (Evidence Lower Bound). Think of this as a Crystal Ball that predicts how hard it will be to guess the next frame of the video.

If the Crystal Ball says, "It's very easy to guess what happens next (because the dog is sleeping)," the system knows it doesn't need many tokens.
If the Crystal Ball says, "It's impossible to guess what happens next (because the cat just jumped)," the system knows it needs many tokens to describe the surprise.

Why is this a big deal?

The researchers tested this on real videos and found:

It saves space: They could cut the number of "postcards" (tokens) by 20% without losing any picture quality.
It's faster: Old methods tried to guess the right amount of space by trial and error (sending a few, checking, sending more, checking again). InfoTok just knows immediately. It is 2.3 times more efficient than previous smart methods.
It's smarter: It doesn't just compress; it understands complexity. A video of a still painting gets compressed heavily; a video of a soccer game gets compressed lightly to keep all the action.

The Bottom Line

InfoTok is like upgrading from a rigid, one-size-fits-all shipping box to a smart, shape-shifting suitcase. It automatically expands when you have a lot of stuff to pack and shrinks when you have little, ensuring you never run out of space for the important stuff and never waste space on the boring stuff. This makes AI video processing faster, cheaper, and capable of handling much longer movies.

1. Problem Statement

Current discrete video tokenizers face a fundamental bottleneck: they utilize fixed-rate compression, meaning they map every video frame to a predetermined number of tokens regardless of the video's actual content complexity.

Inefficiency: Simple, static videos (e.g., a sleeping dog) are over-compressed with redundant tokens, while complex, dynamic videos (e.g., a fight scene) suffer from information loss due to insufficient tokens.
Sub-optimality of Existing Adaptive Methods: Recent attempts at adaptive tokenization (e.g., ElasticTok) rely on heuristic, data-agnostic training (e.g., uniform random masking) and trial-and-error inference (binary search to find the optimal length). The authors prove theoretically that these methods are biased, leading to expected token lengths significantly larger than the theoretical optimum for a given reconstruction quality.
Goal: Develop a principled, theoretically optimal framework for adaptive video tokenization that dynamically allocates tokens based on the informational richness of the video content, minimizing redundancy without sacrificing reconstruction quality.

2. Methodology: INFOTOK

The authors propose INFOTOK, a framework grounded in Shannon's Information Theory and the Source Coding Theorem. The core insight is that the optimal token length for a video $x$ should be proportional to its negative log-likelihood ( $-\log p(x)$ ), which represents its information content.

The framework consists of three main components:

A. Theoretical Foundation

The paper rigorously proves that:

Fixed-length tokenizers are suboptimal because they ignore the varying information density of different videos.
Existing data-agnostic adaptive routers (using uniform distributions) are biased. They fail to incentivize the model to reduce token length for low-information data, resulting in an expected token length that can be arbitrarily larger than the theoretical optimum ( $H_C(D)$ ).

B. The Router (ELBO-Based Length Selection)

To determine the optimal token count $N_x$ for a specific video without intractable log-likelihood calculations, INFOTOK uses the Evidence Lower Bound (ELBO) as a surrogate.

Mechanism: The router estimates the information complexity of the input video $x$ using the ELBO derived from a pre-trained fixed-length encoder-decoder.
Formula: The token length is determined by $r_\beta(N_x|x) = \delta(\beta \cdot \frac{\text{ELBO}(x)}{E[\text{ELBO}(x)]})$ , where $\beta$ is a compression factor controlling the average token budget.
Advantage: This allows the system to dynamically assign shorter sequences to low-complexity frames and longer sequences to high-complexity frames, aligning with information-theoretic optimality.

C. The Adaptive Compressor

Once $N_x$ is determined, the system must compress the fixed-length latent embeddings $h$ into a sequence of length $N_x$ .

Likelihood-Based Pruning: Instead of simple truncation or random masking, the compressor computes a binary mask based on per-token log-likelihoods (approximated via ELBO).
Process: It preserves the top $N_x$ tokens with the highest information content (highest log-likelihood) and discards the rest.
Architecture: The compressor and decompressor are implemented as Transformer layers (specifically 8-layer ViT with block-causal attention) that learn to redistribute information from masked tokens to the remaining active tokens.
Overhead: The mask is stored as part of the token sequence, adding a negligible overhead (~5% in token length).

D. INFOTOK-Flex

To handle varying compression requirements without retraining, the authors introduce INFOTOK-Flex. This variant is trained with a mixture of different compression factors ( $\beta$ ), allowing a single model to adapt to different target bitrates (BPP) at inference time.

3. Key Contributions

Theoretical Proof: A rigorous proof demonstrating that existing fixed-rate and data-agnostic adaptive tokenizers are inherently biased and inefficient compared to information-theoretic optimality.
INFOTOK Framework: A novel architecture that integrates an ELBO-based router for dynamic token length selection and a Transformer-based adaptive compressor for likelihood-aware token pruning.
Efficiency: The method eliminates the need for the expensive binary search (trial-and-error) used in prior adaptive methods, requiring only one additional decoder pass to compute the ELBO.
State-of-the-Art Performance: Empirical validation showing significant token savings while maintaining or improving reconstruction quality.

4. Experimental Results

The authors evaluated INFOTOK on TokenBench and DAVIS datasets, comparing it against fixed-length tokenizers (Cosmos, OmniTokenizer) and adaptive baselines (ElasticTok).

Token Efficiency:
- INFOTOK saves ~20% tokens compared to state-of-the-art fixed-length tokenizers (Cosmos) with no loss in reconstruction quality (PSNR, SSIM, LPIPS, FVD).
- At a fixed compression rate, INFOTOK outperforms ElasticTok by a 2.3× compression rate improvement (achieving similar quality with 2.3x fewer tokens).
Reconstruction Quality:
- At BPP16 = 0.81, INFOTOK achieves a PSNR of 30.08 (vs. 28.26 for ElasticTok) and an FVD of 49 (vs. 141 for ElasticTok) on TokenBench.
- INFOTOK matches the performance of the fixed-length Cosmos tokenizer (PSNR 30.01) but uses only 0.81 BPP (vs. 1.00 BPP for Cosmos).
Inference Efficiency:
- ElasticTok requires a binary search over token blocks, leading to 12 Network Forward Evaluations (NFEs) per video.
- INFOTOK requires only 2 NFEs (1 for ELBO estimation, 1 for decoding), resulting in a ~11× reduction in inference latency (1.23s vs. 13.45s per video).
Ablation Studies:
- The ELBO-based router performs nearly identically to an "Optimal" oracle that exhaustively searches for the best token length, validating the theoretical approach.
- Likelihood-based token pruning significantly outperforms random masking (R2L) or spatially dispersed masking (Jump).

5. Significance and Future Impact

Scalability for Long Videos: By reducing token counts by 20–50% without quality loss, INFOTOK makes processing long video sequences feasible for Transformer-based models, which are otherwise limited by quadratic complexity.
Unified Multi-modal Models: The framework offers a principled way to unify vision and language tasks by providing a more efficient, information-dense representation that aligns better with Large Language Models (LLMs).
Generalizability: While focused on video, the information-theoretic principles apply to any data modality with variable information density, such as audio or 3D point clouds.
Paradigm Shift: The paper moves the field from heuristic, trial-and-error adaptive tokenization to a principled, theory-driven approach based on information theory.

In conclusion, INFOTOK represents a significant leap forward in video representation learning, proving that adaptive tokenization driven by information theory can achieve near-optimal compression efficiency and superior reconstruction quality compared to current state-of-the-art methods.