OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Imagine you are trying to watch a 2-hour movie, but you only have enough energy to read 10 pages of the script.

The Old Way (Current AI):
Most current AI models try to read every single word of the script, from the opening credits to the final fade-out. They read the descriptions of the trees, the clouds, the walls, and the silence just as intently as the part where the hero punches the villain. They waste their limited energy on the boring, static parts, leaving them exhausted and confused when the exciting action finally happens. They treat every moment of the video as equally important.

The New Way (OneVision-Encoder):
The researchers behind OneVision-Encoder realized that human brains (and video compression technology) don't work that way. We don't remember the static background; we remember the changes.

They built an AI that works like a smart movie editor or a surveillance camera with a brain. Here is how it works, using simple analogies:

1. The "Codec" Secret (The Magic Trick)

Think of how your phone compresses a video to save space. It doesn't save every single frame perfectly.

I-Frames (The Snapshot): It saves one full, clear picture of the whole scene (like a photo).
P-Frames (The Updates): For the next 30 seconds, it doesn't save the whole picture again. It only saves a tiny note saying, "The guy in the red shirt moved 2 inches to the left."

Current AI ignores these "tiny notes" and tries to re-read the whole picture every time. OneVision-Encoder is the first to say: "Hey, let's just read the tiny notes! That's where the actual story is happening."

2. The "Motion Detective" (Spotting the Action)

Instead of looking at the whole screen, this AI acts like a motion detective.

If a tree is swaying in the wind, the AI ignores it (it's predictable).
If a bird suddenly flies across the screen, the AI zooms in on that bird.
If a car crashes, the AI focuses entirely on the crash.

It only pays attention to the 3% to 25% of the video that is actually changing or surprising. It ignores the boring, static background. This is like reading a book but only highlighting the sentences where the plot twists, skipping all the descriptions of the furniture.

3. The "Smart Token" Budget

Imagine you have a bucket of 2,000 LEGO bricks (these are the "tokens" the AI uses to understand the video).

Old AI: Uses 1,000 bricks to build a static wall (the background) and 1,000 bricks to build a tiny, blurry action scene. The result is a messy, inefficient model.
OneVision-Encoder: Uses 200 bricks to build the wall (just enough to know where you are) and dumps the remaining 1,800 bricks into building a detailed, high-speed action scene.

Because it focuses its "LEGO bricks" only on the important parts, it understands the video better and faster, even though it uses fewer resources.

4. The "Cluster" Teacher (Learning by Grouping)

How does the AI know what is important? It doesn't just guess.
Imagine a teacher with a giant whiteboard containing 1 million categories (like "jumping," "cooking," "falling," "smiling").
Instead of just saying "This is a dog," the AI learns to group similar actions together. It learns that "a dog jumping" and "a cat jumping" belong in the same "Jumping" cluster, while "a dog sleeping" belongs in a different cluster. This helps it understand the meaning of the movement, not just the pixels.

The Result: Why Does This Matter?

The paper shows that this new AI is a superstar:

It's Smarter: It beats other top models (like Qwen3 and SigLIP) on video understanding tests, even though it was trained on much less data.
It's Efficient: It gets better results while looking at far fewer pixels.
It's Universal: It works great on single images, short clips, and long movies.

In a nutshell:
OneVision-Encoder stops trying to memorize the entire movie frame-by-frame. Instead, it learns to watch the movie like a human: ignoring the boring background and focusing entirely on the action, the movement, and the surprises. By aligning its brain with how video technology (codecs) actually works, it has created a much more efficient and intelligent way for machines to "see."

1. Problem Statement

Current vision architectures, particularly Video Transformers, treat visual data as dense, uniform grids of pixels. This approach contradicts the fundamental information-theoretic nature of video signals, which are characterized by:

High Redundancy: Most visual content (static backgrounds, predictable motion) is highly redundant and predictable from context.
Sparse Discriminative Information: The "surprise" or meaningful information (motion, object changes, residuals) is sparse and localized.
Inefficiency: Processing dense pixel grids uniformly wastes computational resources on static regions while potentially missing critical, fine-grained temporal dynamics due to sparse frame sampling.

The authors hypothesize that Artificial General Intelligence (AGI) is a compression problem. To scale visual intelligence, architectures must align with the predictive structure of data, similar to how video codecs (e.g., H.264/HEVC) function by separating stable spatial context (I-frames) from sparse temporal updates (P-frames).

2. Methodology: OneVision-Encoder (OV-Encoder)

The paper proposes OneVision-Encoder, a unified self-supervised vision transformer that reframes visual modeling as predictive compression. The core methodology consists of three pillars:

A. Codec Patchification (Input Formulation)

Instead of uniform frame sampling, OV-Encoder adopts a Codec-Inspired Input Formulation that leverages signals exposed by video codecs (motion vectors and residuals) to select informative patches.

Dense Video-Codec Patchification:
- Videos are processed as HEVC Groups of Pictures (GOPs), containing one I-frame (full spatial context) and multiple P-frames (motion-compensated residuals).
- Saliency Selection: For P-frames, the model calculates a saliency score based on motion magnitude and residual energy. Only the top 3.1%–25% of patches (the most dynamic regions) are selected.
- Token Budget: The system maintains a fixed token budget (e.g., 2048 tokens for a 64-frame clip). I-frames provide the spatial anchor, while P-frame tokens are allocated strictly to motion-rich regions. This achieves an 87.5% reduction in tokens compared to dense processing while preserving full temporal coverage.
Chunk-wise Patchification: For longer videos, the stream is divided into temporal chunks, with patch selection performed globally across chunks to ensure temporal stratification.
Single-Image Spatial Patchification: A spatial instantiation for static images, ensuring the encoder handles both modalities uniformly.

B. Unified Architecture & Positional Encoding

Shared Backbone: A single Vision Transformer (ViT) processes images, video chunks, and codec-selected sparse patches.
3D Rotary Positional Embedding (RoPE): To handle irregular token layouts (sparse patches across time), the model uses a shared 3D RoPE. It encodes relative offsets $(\Delta t, \Delta x, \Delta y)$ $(Δ t, Δ x, Δ y)$ , allowing the model to reason coherently over:
- Inter-frame residuals (I/P frames).
- Non-uniform temporal sampling (chunks).
- Static spatial layouts.

C. Training Objective: Cluster Discrimination

To enforce semantic structure without relying on external language labels for pretraining, the authors employ a Self-Supervised Cluster Discrimination objective:

Concept Bank: A large-scale bank of over 1 million semantic clusters is constructed offline using a frozen encoder.
Dual Granularity: The objective jointly optimizes:
- Object-level semantics (from images).
- Motion-level semantics (from videos).
Loss Function: A contrastive loss pulls visual embeddings toward their assigned cluster centroids (positive) and pushes them away from other centroids (negative), enforcing structured, modality-agnostic representation learning.

3. Key Contributions

Codec-Aligned Sparsity: Introduces Codec Patchification, a novel input strategy that treats video understanding as a compression problem, selectively encoding only the 3.1%–25% of regions rich in signal entropy.
Unified Spatiotemporal Encoder: Proposes a single encoder capable of handling dense images, sparse video frames, and irregular codec-derived token layouts via 3D RoPE.
Scalable Pretraining Strategy: Demonstrates that a model pretrained on ~100B tokens (using the proposed sparse strategy) can outperform models pretrained on >2.1T tokens (e.g., Qwen3-ViT) when integrated into Large Multimodal Models (LMMs).
Theoretical Alignment: Establishes that aligning architecture with the information-theoretic structure of data (predictive residuals) yields better efficiency and accuracy than uniform dense processing.

4. Experimental Results

The OV-Encoder was evaluated across multimodal benchmarks (LMM probing) and representation-level benchmarks (attentive probing).

Multimodal Performance (LMM Probing):
- Integrated into Qwen3-4B, OV-Encoder outperformed strong baselines (Qwen3-ViT and SigLIP2) across 16 benchmarks (image, video, and document understanding).
- Video Understanding: Achieved a 4.1% average improvement over Qwen3-ViT.
- Data Efficiency: Despite using significantly fewer pretraining tokens (100B vs. 2.1T), OV-Encoder surpassed Qwen3-ViT, proving the efficacy of the codec-aligned approach over brute-force scaling.
Representation Quality (Attentive Probing):
- Evaluated on 7 video benchmarks (e.g., Diving-48, SSV2, Kinetics-400) with frozen backbones.
- Diving-48: Achieved a 17.1% Top-1 accuracy improvement over SigLIP2 and 8.1% over DINOv3 under identical patch budgets (2048 tokens).
- Token Efficiency: Under a fixed token budget, OV-Encoder (Codec) outperformed dense SigLIP2 on motion-sensitive tasks (Diving-48, Perception Test) while reducing patch processing by 75%–96.9%.
Ablation Studies:
- Replacing codec-selected motion patches with non-motion patches caused significant performance drops, confirming that the gains come from semantic motion content, not just sparsity.
- Shuffling patch positions degraded performance, validating the importance of the spatiotemporal structure preserved by the codec alignment.

5. Significance

Paradigm Shift: The paper challenges the prevailing "dense grid" assumption in vision transformers, arguing that sparsity aligned with predictive structure is a foundational principle for general intelligence.
Efficiency-Accuracy Correlation: It demonstrates that efficiency and accuracy are not a trade-off; by focusing compute on "surprise" (residuals/motion), models become both faster and more accurate.
Scalability: The approach offers a scalable path for future multimodal systems, allowing them to process longer, higher-resolution videos without exponential increases in compute costs.
Foundation for AGI: By treating visual understanding as a compression problem, OV-Encoder provides a scalable engine for universal multimodal intelligence that "sees, updates, and reasons over time" in a manner analogous to human perception and video codecs.

In conclusion, OneVision-Encoder successfully bridges the gap between information theory (compression) and deep learning, proving that mimicking the structural decomposition of video codecs leads to superior, efficient, and generalizable visual representations.