Generative Neural Video Compression via Video Diffusion Prior

GNVC-VD is the first DiT-based generative video compression framework that unifies spatio-temporal latent compression and sequence-level refinement via a video diffusion prior to eliminate perceptual flickering and enhance quality under extreme bitrate constraints.

Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you are trying to send a high-definition movie to a friend, but your internet connection is so slow that you can only send a tiny, tiny fraction of the data.

In the past, if you tried to squeeze a movie into such a small space, the result was a blurry, muddy mess. It looked like a painting made by someone with shaky hands. This is what traditional video compression does: it throws away details to save space, leaving you with a smooth, fuzzy image.

To fix this, scientists started using Generative AI (like the tech behind Sora or DALL-E). Think of this like hiring a talented artist to "guess" the missing details. Instead of just sending a blurry photo, the AI looks at the blurry photo and says, "I know what a dog's fur looks like; I'll draw the fur back in."

The Problem with Previous Attempts:
The problem with earlier AI video compressors was that they were amnesiac artists. They looked at one frame (one single picture) and tried to guess the details for that specific moment. But when they moved to the next frame, they forgot what the previous frame looked like.

  • Result: The video would look sharp for a split second, then the texture would jump, flicker, or change shape wildly from frame to frame. It's like watching a movie where the actor's face morphs into a different person every time they blink. This is called temporal flickering.

The Solution: GNVC-VD (The "Memory-Keeping" Director)

The paper introduces a new system called GNVC-VD. Think of this system not as a painter looking at one canvas at a time, but as a film director who watches the whole movie scene at once.

Here is how it works, using simple analogies:

1. The "Video-Native" Brain

Previous AI tools were trained on images (static photos). They didn't understand that video is a sequence of moving parts.
GNVC-VD uses a Video Diffusion Model. Imagine a student who has watched millions of hours of movies. They don't just know what a "face" looks like; they know how a face moves, how hair flows in the wind, and how light changes over time. This is the "Video-Native Prior." It understands the flow of time, not just the snapshot.

2. The "Correction" Strategy (Flow-Matching)

Usually, when AI generates an image, it starts with pure static noise (like TV snow) and slowly cleans it up until an image appears.

  • The Old Way: Start with TV snow \rightarrow Clean up \rightarrow Get a blurry frame \rightarrow Repeat for the next frame (forgetting the first).
  • The GNVC-VD Way:
    1. First, it sends a very compressed, blurry version of the video (the "skeleton").
    2. Instead of starting from TV snow, the AI takes that blurry skeleton and says, "I know what this scene should look like based on the whole movie."
    3. It calculates a correction term. Think of it like a GPS. The blurry video is the "current location," and the AI knows the "destination" (the perfect video). It draws a smooth path to get there, adding the missing details (like skin texture or grass) in a way that matches the movement of the previous and next frames.

3. The "Adapter" (The Translator)

The AI model was trained on perfect, high-quality videos. But the video it's receiving is a compressed, messy version.
GNVC-VD uses a special Adapter (like a translator). It takes the messy, compressed data and translates it into a language the AI understands. This ensures the AI doesn't get confused and hallucinate weird things (like turning a car into a cat). It tells the AI: "Fix the texture, but keep the car looking like a car and moving in the same direction."

The Result: A Stable, Sharp Movie

Because this system looks at the entire sequence of frames together (not just one by one), it ensures that:

  • Textures stay sharp: The AI fills in the missing details.
  • Motion stays smooth: The AI remembers that if a hand was moving left in frame 1, it must continue moving left in frame 2. No more flickering or morphing faces.

In a Nutshell

  • Traditional Compression: Sends a blurry photo. (Low quality, stable).
  • Old Generative Compression: Sends a sharp photo that flickers and changes shape wildly. (High quality, unstable).
  • GNVC-VD (This Paper): Sends a blurry photo, and uses a "movie-smart" AI to fill in the missing details in a way that is both sharp and perfectly smooth, even when the internet connection is terrible.

It's the difference between hiring a painter who only sees one second of a movie and hiring a director who sees the whole scene, ensuring the story flows perfectly from start to finish.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →