Generative Neural Video Compression via Video Diffusion Prior

Imagine you are trying to send a high-definition movie to a friend, but your internet connection is so slow that you can only send a tiny, tiny fraction of the data.

In the past, if you tried to squeeze a movie into such a small space, the result was a blurry, muddy mess. It looked like a painting made by someone with shaky hands. This is what traditional video compression does: it throws away details to save space, leaving you with a smooth, fuzzy image.

To fix this, scientists started using Generative AI (like the tech behind Sora or DALL-E). Think of this like hiring a talented artist to "guess" the missing details. Instead of just sending a blurry photo, the AI looks at the blurry photo and says, "I know what a dog's fur looks like; I'll draw the fur back in."

The Problem with Previous Attempts:
The problem with earlier AI video compressors was that they were amnesiac artists. They looked at one frame (one single picture) and tried to guess the details for that specific moment. But when they moved to the next frame, they forgot what the previous frame looked like.

Result: The video would look sharp for a split second, then the texture would jump, flicker, or change shape wildly from frame to frame. It's like watching a movie where the actor's face morphs into a different person every time they blink. This is called temporal flickering.

The Solution: GNVC-VD (The "Memory-Keeping" Director)

The paper introduces a new system called GNVC-VD. Think of this system not as a painter looking at one canvas at a time, but as a film director who watches the whole movie scene at once.

Here is how it works, using simple analogies:

1. The "Video-Native" Brain

Previous AI tools were trained on images (static photos). They didn't understand that video is a sequence of moving parts.
GNVC-VD uses a Video Diffusion Model. Imagine a student who has watched millions of hours of movies. They don't just know what a "face" looks like; they know how a face moves, how hair flows in the wind, and how light changes over time. This is the "Video-Native Prior." It understands the flow of time, not just the snapshot.

2. The "Correction" Strategy (Flow-Matching)

Usually, when AI generates an image, it starts with pure static noise (like TV snow) and slowly cleans it up until an image appears.

The Old Way: Start with TV snow $\rightarrow$ Clean up $\rightarrow$ Get a blurry frame $\rightarrow$ Repeat for the next frame (forgetting the first).
The GNVC-VD Way:
1. First, it sends a very compressed, blurry version of the video (the "skeleton").
2. Instead of starting from TV snow, the AI takes that blurry skeleton and says, "I know what this scene should look like based on the whole movie."
3. It calculates a correction term. Think of it like a GPS. The blurry video is the "current location," and the AI knows the "destination" (the perfect video). It draws a smooth path to get there, adding the missing details (like skin texture or grass) in a way that matches the movement of the previous and next frames.

3. The "Adapter" (The Translator)

The AI model was trained on perfect, high-quality videos. But the video it's receiving is a compressed, messy version.
GNVC-VD uses a special Adapter (like a translator). It takes the messy, compressed data and translates it into a language the AI understands. This ensures the AI doesn't get confused and hallucinate weird things (like turning a car into a cat). It tells the AI: "Fix the texture, but keep the car looking like a car and moving in the same direction."

The Result: A Stable, Sharp Movie

Because this system looks at the entire sequence of frames together (not just one by one), it ensures that:

Textures stay sharp: The AI fills in the missing details.
Motion stays smooth: The AI remembers that if a hand was moving left in frame 1, it must continue moving left in frame 2. No more flickering or morphing faces.

In a Nutshell

Traditional Compression: Sends a blurry photo. (Low quality, stable).
Old Generative Compression: Sends a sharp photo that flickers and changes shape wildly. (High quality, unstable).
GNVC-VD (This Paper): Sends a blurry photo, and uses a "movie-smart" AI to fill in the missing details in a way that is both sharp and perfectly smooth, even when the internet connection is terrible.

It's the difference between hiring a painter who only sees one second of a movie and hiring a director who sees the whole scene, ensuring the story flows perfectly from start to finish.

1. Problem Statement

Neural Video Compression (NVC) has surpassed traditional standards (like HEVC and VVC) in rate-distortion optimization. However, under ultra-low bitrate constraints (specifically < 0.03 bpp), traditional distortion-driven objectives (e.g., MSE) cause severe oversmoothing, erasing fine textures and structural details.

To address this, recent approaches utilize generative priors (GANs or Diffusion Models) to hallucinate realistic details. However, existing generative video codecs rely on image-based priors (frame-wise generation). This approach fails to model temporal dynamics, leading to:

Temporal Flickering: Inconsistent textures and structures across frames.
Motion Drift: Unstable motion reconstruction.
Lack of Long-Range Dependencies: Inability to capture coherent spatio-temporal relationships.

The core challenge is to integrate a generative prior that is video-native (capable of modeling temporal coherence) into a compression framework without introducing the flickering artifacts seen in current state-of-the-art methods like GLC-Video.

2. Methodology: GNVC-VD

The authors propose GNVC-VD, the first generative NVC framework built upon a Video Diffusion Transformer (VideoDiT). The framework unifies spatio-temporal latent compression and sequence-level generative refinement.

A. Overall Framework

The pipeline consists of two main stages:

Spatio-Temporal Latent Compression: Compresses the video into a compact latent sequence.
Flow-Matching Latent Refinement: Uses a pre-trained VideoDiT to refine the compressed latents into a high-fidelity sequence.

B. Spatio-Temporal Latent Compression

Encoder: Utilizes a 3D causal VAE encoder (from Wan2.1) to extract spatio-temporal latent sequences ( $x_1$ ).
Contextual Transform Coding:
- Anchor Frame (I-frame): Encoded independently.
- Predictive Frames (P-frames): Encoded conditionally on the previously decoded latent ( $\hat{l}_{t-1}$ ) to reduce temporal redundancy.
- Mechanism: Uses an analysis transform ( $g_a$ ) and synthesis transform ( $g_s$ ) that take temporal context features as input, similar to the DCVC-RT architecture.
Entropy Coding: Quantized latents are entropy-coded to produce the bitstream.

C. Flow-Matching Latent Refinement

This is the core innovation. Instead of denoising from pure Gaussian noise (as in standard video generation), GNVC-VD refines the decoded, noisy latents directly.

Initialization: The decoded latent sequence ( $x_c$ ) is perturbed with Gaussian noise at a specific level $t_N$ to create $x_{t_N}$ .
Flow Matching: The model learns a continuous velocity field to transport $x_{t_N}$ back to the clean latent space ( $x_1$ ).
Correction Term: The target velocity field is decomposed into:
1. Pre-trained Velocity ( $v_{pre}$ ): The baseline from the frozen VideoDiT.
2. Correction Term ( $\Delta v_{fine}$ ): A learnable component that adapts the prior to compression-induced degradation.
Conditioning Adapter: To inject compression-aware cues, Adapter Blocks are inserted into the VideoDiT layers. These take contextual features extracted from the compressed domain and modulate the DiT activations, ensuring the refinement aligns with the specific artifacts of the compression.

D. Two-Stage Training Strategy

To bridge the gap between the quantized codec latents and the diffusion manifold:

Stage I (Latent-Level Alignment): Trains the codec and diffusion adapter to ensure the refined latents ( $\tilde{x}_1$ ) are semantically and structurally consistent with the ground-truth clean latents ( $x_1$ ). The loss combines Rate-Distortion and Conditional Flow-Matching loss.
Stage II (Pixel-Level Fine-Tuning): Fine-tunes the entire pipeline in the pixel domain to maximize perceptual quality and temporal coherence, using a combination of distortion, perceptual (LPIPS), and rate losses.

3. Key Contributions

First Video-Native Generative Codec: GNVC-VD is the first framework to leverage a pre-trained Video Diffusion Transformer for neural video compression, moving beyond the limitations of frame-wise image priors.
Unified Sequence-Level Refinement: It performs sequence-level generative denoising across both intra- and inter-frame latents, ensuring temporal coherence and eliminating flickering.
Compression-Aware Conditioning: Introduces a novel flow-matching refinement mechanism with conditioning adapters that learn a correction term to adapt the generative prior specifically to compression artifacts.
Two-Stage Training: Proposes a strategy to align the codec's latent space with the diffusion prior before fine-tuning for perceptual quality, ensuring stability.

4. Experimental Results

The method was evaluated on standard benchmarks (HEVC-B, UVG, MCL-JCV) at ultra-low bitrates (< 0.03 bpp).

Perceptual Quality: GNVC-VD achieves State-of-the-Art (SOTA) performance in perceptual metrics (LPIPS and DISTS).
- Compared to the distortion-oriented baseline DCVC-RT, it achieves a 56% reduction in BD-rate for LPIPS and 98% for DISTS on the UVG dataset.
- Compared to the generative baseline GLC-Video, it further reduces BD-rate by 21% (LPIPS) and 86% (DISTS).
Temporal Stability:
- Flickering: Significantly reduces temporal flickering compared to GLC-Video.
- Metrics: Achieves lower Warp Error ( $E_{warp}$ ) and higher CLIP-F scores than GLC-Video, indicating superior motion consistency and semantic continuity.
Visual Quality: Qualitative comparisons show GNVC-VD restores sharp textures and fine details while maintaining stable motion, whereas GLC-Video exhibits structural hallucinations and flickering, and traditional codecs appear blurry.
User Study: In subjective tests, GNVC-VD received over 85% preference against traditional and learned codecs, and nearly 99% preference against GLC-Video.

5. Significance and Impact

Paradigm Shift: This work demonstrates that video-native generative priors are essential for next-generation perceptual video compression. It proves that treating video compression as a sequence-level conditional denoising problem yields superior results to frame-wise approaches.
Solving the Flickering Problem: It effectively addresses the "perceptual flickering" bottleneck that has hindered the adoption of generative codecs in video.
Ultra-Low Bitrate Viability: GNVC-VD makes ultra-low bitrate video transmission (e.g., < 0.01 bpp) viable for applications requiring high perceptual fidelity, such as remote sensing, low-bandwidth streaming, and archival storage.
Future Direction: The paper highlights that while inference latency is currently high (due to the diffusion process), the architecture provides a strong foundation for future research into efficient video generation and compression integration.