Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

The Big Problem: The "Memory Overflow" in AI Video

Imagine you are trying to tell a very long, complex story to a friend (the AI). To keep the story consistent, your friend needs to remember every detail you've said so far.

The Old Way (Bidirectional Models): Imagine your friend tries to remember the entire story at once, from the beginning to the end, before speaking a single word. This is great for quality, but if the story gets too long (like a 10-minute movie), your friend's brain (the computer's GPU memory) literally explodes. They can't hold that much information at once.
The New Way (Auto-Regressive Models): To fix this, we told the friend to tell the story one sentence at a time, remembering only what they just said. This is much better for memory. However, there's a catch: as the story gets longer, the "memory note" they have to keep grows bigger and bigger. Eventually, even this method runs out of space.

The Bottleneck: The AI needs to keep a massive "scratchpad" (called the KV-Cache) to remember the video it has already generated. For long videos, this scratchpad takes up so much memory (often 30+ GB) that it won't fit on standard computers, and it forces the AI to forget old details, causing the video to look weird or drift off-topic.

The Solution: Quant VideoGen (QVG)

The researchers created a new trick called Quant VideoGen (QVG). Think of it as a super-smart compression algorithm for the AI's memory.

Instead of storing every single detail of the video's history in high-definition (which takes up huge space), QVG stores it in a highly compressed, efficient way without losing the "soul" of the video.

Here is how they did it, using two main tricks:

Trick 1: Semantic-Aware Smoothing (The "Group Hug" Strategy)

The Problem: In a video, the numbers the AI uses to remember things are messy and chaotic. Some numbers are huge, some are tiny. It's like trying to pack a suitcase where you have a giant beach ball, a tiny pebble, and a heavy brick all mixed together. You can't compress them easily because they are so different.

The Solution:

Grouping: QVG looks at the video's memory and says, "Hey, these 100 pixels are all part of a blue sky, and these 100 pixels are part of a green tree." It groups similar things together (like sorting socks by color).
Subtracting the Average: For the "blue sky" group, it calculates the average color and writes that down once. Then, it only stores the tiny differences (residuals) between the actual pixels and that average.
- Analogy: Instead of writing "Sky is #0000FF, Sky is #0000FF, Sky is #0000FF..." 1,000 times, you just write "Sky is #0000FF" once, and then write "0 difference" for the rest.
The Result: The numbers left over are very small and uniform. This makes them incredibly easy to shrink (quantize) down to just 2 bits (a tiny amount of data) without losing meaning.

Trick 2: Progressive Residual Quantization (The "Layered Cake" Strategy)

The Problem: Even after grouping, there are still some tiny details left over that need to be stored. If you try to squash them all at once, you lose quality.

The Solution:
Think of this like drawing a picture or streaming a video.

Layer 1 (The Sketch): First, you store the rough outline and main colors.
Layer 2 (The Details): Then, you store the small details that make the picture look real.
Layer 3 (The Polish): Finally, you store the super-fine textures.

QVG does this in stages. It compresses the "rough sketch" first, then compresses the "details" of the remaining errors, and so on. This allows the AI to keep the most important visual information while throwing away the math that doesn't matter.

Why This Matters: The Magic Results

Before this paper, making a long, high-quality video on a single consumer graphics card (like an RTX 4090) was impossible because the memory requirements were too high.

With Quant VideoGen:

7x Smaller: They reduced the memory needed by 7 times. A video that used to need a 34 GB memory chunk now only needs 5 GB.
No Quality Loss: Even though they compressed the data so much, the video looks almost identical to the original. In fact, because they can now keep more history in the memory (since it's smaller), the video stays consistent for longer. The characters don't change faces, and the background doesn't warp.
Fast: It only slows down the video generation by about 4%, which is barely noticeable.

The Bottom Line

Imagine you want to send a 1-hour movie to a friend, but your mailbox is too small to fit the DVD.

Old AI: Tries to mail the whole DVD, fails, and gives up.
Quant VideoGen: Takes the movie, realizes most of the frames are just "blue sky" or "green grass," groups them together, and sends a tiny, compressed file that reconstructs the movie perfectly on the other end.

This breakthrough means we can finally run powerful, long-duration video generators on regular computers, opening the door for AI to create movies, interactive games, and world simulations that were previously impossible.

1. Problem Statement

The paper addresses a critical bottleneck in auto-regressive video diffusion models: the KV-cache memory explosion.

The Bottleneck: Unlike bidirectional attention models (which output only after full denoising), auto-regressive models generate frames sequentially, requiring the retention of a growing Key-Value (KV) cache to maintain temporal consistency. For high-resolution, long-horizon generation (e.g., 5-minute videos), the KV-cache can consume >30 GB of GPU memory, exceeding the capacity of consumer and even many data-center GPUs (e.g., RTX 5090).
The Capability Gap: To fit within memory limits, current systems truncate the context window (e.g., to 20–60 seconds). This truncation forces the model to "forget" early frames, leading to long-horizon drift where identity, layout, and motion consistency degrade over time.
The Challenge of Naive Quantization: Existing KV-cache quantization techniques designed for Large Language Models (LLMs) fail when applied directly to video. Video models exhibit highly heterogeneous activation distributions across tokens and channels, with extreme outliers that cause severe quality degradation when quantized to low bits (e.g., 2-bit).

2. Methodology: Quant VideoGen (QVG)

The authors propose Quant VideoGen (QVG), a training-free framework that exploits the unique spatiotemporal redundancy of video data to enable 2-bit KV-cache quantization without significant quality loss. The framework consists of two core components:

A. Semantic-Aware Smoothing (SAS)

This step transforms the irregular, outlier-heavy KV-cache distribution into a quantization-friendly format.

Semantic Grouping: Instead of quantizing tokens individually or in fixed blocks, QVG uses k-means clustering along the sequence length axis to group tokens with similar latent representations (semantic similarity).
Centroid Subtraction: For each group, the algorithm calculates the centroid (mean value) and subtracts it from the group's tokens.
- Effect: This removes the large magnitude outliers (captured by the centroid) and leaves behind residuals with much smaller magnitudes and a more homogeneous distribution.
- Result: The residuals are significantly easier to quantize with low bit-widths, reducing the quantization error by approximately 6.9× for Keys and 2.6× for Values.

B. Progressive Residual Quantization (PRQ)

Inspired by streaming video codecs, this module further compresses the residuals in a coarse-to-fine manner.

Multi-Stage Refinement: The process is iterative. After the first round of SAS and quantization, the remaining error (residual) is treated as a new input.
Iterative Compression: The algorithm repeats the grouping and centroid subtraction process on the residuals for $T$ stages.
Trade-off Control: By varying the number of stages ( $T$ ), users can flexibly trade off between compression ratio and reconstruction quality.
Reconstruction: During inference, the model dequantizes the residuals and adds back the stored centroids (and previous stage residuals) to reconstruct the original KV-cache values.

C. System-Algorithm Co-Design

To ensure low latency during streaming inference:

Streaming Centroid Caching: Instead of running k-means from scratch for every new video chunk, the system initializes centroids based on the previous chunk's assignment, reducing k-means overhead by 3×.
Fused Kernels: A custom CUDA/Triton kernel fuses dequantization and centroid addition, keeping intermediate results in registers to minimize global memory access.

3. Key Contributions

First Training-Free 2-Bit KV Quantization for Video: QVG is the first framework to successfully enable 2-bit KV-cache quantization for auto-regressive video diffusion models without retraining.
Spatiotemporal Redundancy Exploitation: It introduces Semantic-Aware Smoothing, a novel technique that leverages the high similarity between adjacent video tokens (spatially and temporally) to smooth activation distributions, solving the heterogeneity problem that plagues LLM-based quantization methods.
Progressive Residual Quantization: A multi-stage compression scheme that allows for flexible quality-memory trade-offs, pushing the Pareto frontier of video generation.
Hardware Efficiency: The method reduces KV-cache memory footprint by up to 7.0× with minimal latency overhead (<4%), enabling long-video generation on consumer hardware (e.g., running an 8B model on a single RTX 4090).

4. Experimental Results

The authors evaluated QVG on three state-of-the-art models: LongCat-Video-13B, HY-WorldPlay-8B, and Self-Forcing-Wan-1.3B.

Memory Compression:
- Achieved up to 7.05× compression (INT2) and 3.75× (INT4) compared to BF16 baselines.
- Reduced KV-cache from ~34 GB to ~5 GB for a 5-second 480p video generation task.
Video Quality:
- PSNR: QVG achieved 28.7 PSNR (LongCat-Video) and 29.1 PSNR (HY-WorldPlay) at 2-bit quantization, significantly outperforming baselines like KIVI and RTN (which dropped to ~20–24 PSNR).
- Consistency: On long-horizon generation (up to 700 frames), QVG maintained near-lossless quality, whereas baselines suffered severe drift after ~100 frames.
- Metrics: Outperformed all baselines on VBench metrics (Background Consistency, Subject Consistency, Aesthetic Quality) with near-lossless scores (e.g., SSIM > 0.9, LPIPS < 0.07).
Efficiency:
- Latency: End-to-end generation time increased by only 1.5% – 4.3%, making it practical for real-time applications.
- Deployment: Enabled the first successful deployment of HY-WorldPlay-8B on a single RTX 4090, a task previously impossible due to memory constraints.

5. Significance

This work bridges the gap between algorithmic capability and system constraints in video generation.

Enabling Long-Horizon Generation: By solving the memory bottleneck, QVG allows models to retain full context history, directly improving the consistency and coherence of minute-level or hour-level video generation.
Democratization of Video AI: By reducing memory requirements by 7×, it makes high-quality, long-form video generation feasible on consumer-grade GPUs, lowering the barrier to entry for developers and researchers.
New Paradigm for Video Compression: The approach of exploiting spatiotemporal redundancy via semantic grouping offers a new direction for compressing not just video data, but the internal representations of generative models.

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

The Big Problem: The "Memory Overflow" in AI Video

The Solution: Quant VideoGen (QVG)

Trick 1: Semantic-Aware Smoothing (The "Group Hug" Strategy)

Trick 2: Progressive Residual Quantization (The "Layered Cake" Strategy)

Why This Matters: The Magic Results

The Bottom Line

1. Problem Statement

2. Methodology: Quant VideoGen (QVG)

A. Semantic-Aware Smoothing (SAS)

B. Progressive Residual Quantization (PRQ)

C. System-Algorithm Co-Design

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank