OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

The paper introduces OneVision-Encoder, a multimodal architecture that aligns with video codec principles by focusing computation on sparse, high-entropy regions rather than uniform pixel grids, thereby achieving superior efficiency and accuracy across image, video, and document understanding benchmarks.

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you are trying to watch a 2-hour movie, but you only have enough energy to read 10 pages of the script.

The Old Way (Current AI):
Most current AI models try to read every single word of the script, from the opening credits to the final fade-out. They read the descriptions of the trees, the clouds, the walls, and the silence just as intently as the part where the hero punches the villain. They waste their limited energy on the boring, static parts, leaving them exhausted and confused when the exciting action finally happens. They treat every moment of the video as equally important.

The New Way (OneVision-Encoder):
The researchers behind OneVision-Encoder realized that human brains (and video compression technology) don't work that way. We don't remember the static background; we remember the changes.

They built an AI that works like a smart movie editor or a surveillance camera with a brain. Here is how it works, using simple analogies:

1. The "Codec" Secret (The Magic Trick)

Think of how your phone compresses a video to save space. It doesn't save every single frame perfectly.

  • I-Frames (The Snapshot): It saves one full, clear picture of the whole scene (like a photo).
  • P-Frames (The Updates): For the next 30 seconds, it doesn't save the whole picture again. It only saves a tiny note saying, "The guy in the red shirt moved 2 inches to the left."

Current AI ignores these "tiny notes" and tries to re-read the whole picture every time. OneVision-Encoder is the first to say: "Hey, let's just read the tiny notes! That's where the actual story is happening."

2. The "Motion Detective" (Spotting the Action)

Instead of looking at the whole screen, this AI acts like a motion detective.

  • If a tree is swaying in the wind, the AI ignores it (it's predictable).
  • If a bird suddenly flies across the screen, the AI zooms in on that bird.
  • If a car crashes, the AI focuses entirely on the crash.

It only pays attention to the 3% to 25% of the video that is actually changing or surprising. It ignores the boring, static background. This is like reading a book but only highlighting the sentences where the plot twists, skipping all the descriptions of the furniture.

3. The "Smart Token" Budget

Imagine you have a bucket of 2,000 LEGO bricks (these are the "tokens" the AI uses to understand the video).

  • Old AI: Uses 1,000 bricks to build a static wall (the background) and 1,000 bricks to build a tiny, blurry action scene. The result is a messy, inefficient model.
  • OneVision-Encoder: Uses 200 bricks to build the wall (just enough to know where you are) and dumps the remaining 1,800 bricks into building a detailed, high-speed action scene.

Because it focuses its "LEGO bricks" only on the important parts, it understands the video better and faster, even though it uses fewer resources.

4. The "Cluster" Teacher (Learning by Grouping)

How does the AI know what is important? It doesn't just guess.
Imagine a teacher with a giant whiteboard containing 1 million categories (like "jumping," "cooking," "falling," "smiling").
Instead of just saying "This is a dog," the AI learns to group similar actions together. It learns that "a dog jumping" and "a cat jumping" belong in the same "Jumping" cluster, while "a dog sleeping" belongs in a different cluster. This helps it understand the meaning of the movement, not just the pixels.

The Result: Why Does This Matter?

The paper shows that this new AI is a superstar:

  • It's Smarter: It beats other top models (like Qwen3 and SigLIP) on video understanding tests, even though it was trained on much less data.
  • It's Efficient: It gets better results while looking at far fewer pixels.
  • It's Universal: It works great on single images, short clips, and long movies.

In a nutshell:
OneVision-Encoder stops trying to memorize the entire movie frame-by-frame. Instead, it learns to watch the movie like a human: ignoring the boring background and focusing entirely on the action, the movement, and the surprises. By aligning its brain with how video technology (codecs) actually works, it has created a much more efficient and intelligent way for machines to "see."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →