DVD-Quant: Data-free Video Diffusion Transformers Quantization

Imagine you have a Hollywood-level movie studio inside your computer. This studio, called a "Video Diffusion Transformer" (or DiT), is incredibly talented at creating realistic videos from text descriptions. However, it's also a gluttonous beast. It eats up massive amounts of memory and takes a long time to cook up a single video, making it impossible to run on regular laptops or phones.

To fix this, scientists usually try to "shrink" the studio's brain (the model) by simplifying its math. This is called quantization. Think of it like translating a complex novel into a comic book: you keep the story, but you use fewer words and simpler drawings.

The problem? Most existing methods of shrinking these video studios are like trying to pack a suitcase by just throwing things in randomly.

They need a rehearsal period (calibration) where they watch hours of sample videos just to figure out how to shrink the model. This takes forever.
If they shrink the model too much (like going from a 4K movie to a blurry 144p), the video turns into static noise or a distorted mess.

Enter DVD-Quant (Data-free Video Diffusion Quantization). Think of DVD-Quant as a master packer who can shrink the studio's brain without needing a rehearsal, and without ruining the movie quality.

Here is how DVD-Quant works, using three clever tricks:

1. The "Smart Ruler" (Bounded-init Grid Refinement)

The Problem: Imagine you are measuring ingredients for a cake. Most people use a ruler that measures from 0 to 100 inches. But what if your ingredients are all tiny, clustered around the 1-inch mark? Using a 0-100 ruler wastes space and makes your measurements imprecise.
The DVD-Quant Solution: Instead of using a fixed ruler, DVD-Quant uses a smart, adjustable ruler.

It starts with a rough guess of where the ingredients are.
Then, it iteratively tightens the ruler's range, zooming in on the specific area where the important numbers live.
The Result: It captures the "flavor" of the video perfectly, even when the numbers are tiny, without needing to look at a sample video first.

2. The "Dynamic Camera" (Auto-scaling Rotated Quantization)

The Problem: Video generation is a process of "denoising"—starting with static snow and slowly revealing a clear image. The "loudness" (scale) of the data changes wildly at every single step.

Analogy: Imagine trying to take a photo of a concert. In the beginning, it's dark and quiet. Then, the band starts playing loud rock. If you set your camera's exposure once at the start, the beginning will be too dark, and the end will be blown out (white).
The DVD-Quant Solution: DVD-Quant acts like a smart camera that adjusts its settings in real-time.
Instead of pre-setting the exposure based on a rehearsal (calibration), it looks at the current frame and instantly adjusts the "volume" (scaling) and rotates the data to smooth out the loud spikes.
The Result: It handles the wild changes in the video process perfectly, keeping the picture clear from the first frame to the last, with zero rehearsal time.

3. The "Traffic Cop" (δ-Guided Bit Switching)

The Problem: Not every second of a video is equally important.

Analogy: In a movie, there are boring scenes where nothing happens (a character walking slowly) and action scenes where everything explodes. If you spend the same amount of "computing power" on the boring walk as you do on the explosion, you are wasting energy.
The DVD-Quant Solution: DVD-Quant acts as a smart traffic cop.
It watches the video being made. If the scene is boring and changing slowly, it says, "Okay, let's use a low-resolution (4-bit) setting to save energy."
If the scene suddenly changes drastically (like a car crash), it immediately switches to high-resolution (8-bit) to capture the details.
The Result: It saves massive amounts of speed and memory by only using high power when it's absolutely necessary.

The Grand Finale: What Does It Achieve?

Before DVD-Quant, trying to shrink a video AI to its smallest size (4-bit weights and 4-bit activations) was like trying to run a Ferrari on a bicycle chain—it just broke. The videos became unrecognizable noise.

DVD-Quant changed the game:

Speed: It makes video generation 2x faster.
Memory: It shrinks the memory needed by nearly 4x.
Quality: It is the first method to successfully run these models at the smallest possible size (W4A4) without the video quality falling apart. The videos look almost as good as the original, giant, slow version.

In short: DVD-Quant is the magic key that unlocks high-quality video generation on everyday devices, turning a supercomputer-sized studio into something that fits in your pocket, all without needing to "practice" first.

1. Problem Statement

Diffusion Transformers (DiTs) have become the state-of-the-art architecture for high-fidelity video generation (e.g., Sora, HunyuanVideo). However, their deployment is hindered by massive computational and memory demands due to:

Iterative Nature: Video generation requires 50–100 denoising steps, each processing long sequences.
Quantization Challenges: Existing Post-Training Quantization (PTQ) methods face two critical limitations when applied to Video DiTs:
1. Calibration Dependence: Most methods rely on heavy, inflexible calibration procedures using datasets to determine scaling factors. These fail to adapt to the timestep-dependent variations in activation distributions inherent in diffusion models.
2. Performance Collapse: Aggressive quantization (specifically W4A4, i.e., 4-bit weights and 4-bit activations) causes severe quality degradation. Baseline methods often fail completely or suffer >27% drops in visual quality metrics (e.g., VBench scores) at W4A4.

2. Methodology: DVD-Quant

The authors propose DVD-Quant, a comprehensive, data-free quantization framework designed specifically for Video DiTs. It addresses the unique characteristics of DiTs (Gaussian-like weight distributions, timestep-varying activations, and latent feature evolution) through three core innovations:

A. Bounded-init Grid Refinement (BGR) for Weights

Insight: DiT weights follow a Gaussian-like distribution. Standard MinMax quantization allocates excessive bins to outlier regions (tails) and creates suboptimal spacing around the zero-mean concentration, leading to high quantization error.
Mechanism:
- Bounded Initialization: Instead of using the full range of weights, the method starts with a "bounded search" that progressively clips outliers to find a tighter initial range.
- Iterative Grid Refinement: It treats quantization as an optimization problem. Starting from the bounded initialization, it iteratively refines the quantization step-size ( $\Delta$ ) and zero-point ( $z$ ) to minimize the reconstruction error ( $\|W - \Delta \odot (W_q - z)\|_F$ ).
- Result: This closed-form iterative approach significantly reduces quantization error for Gaussian-distributed weights without requiring gradient descent or calibration data.

B. Auto-scaling Rotated Quantization (ARQ) for Activations

Insight: Activation scales in DiTs vary drastically across denoising timesteps. Offline calibration (pre-scaling) cannot capture this dynamic range, and standard rotation methods (like Quarot) can inadvertently amplify errors or introduce high latency.
Mechanism:
- Hadamard Rotation: Applies a fast Hadamard transform to both activations and weights to suppress outliers and distribute values evenly across channels.
- Online Scaling: Unlike pre-scaling methods that transfer scaling factors to weights, ARQ computes per-channel scaling factors online during inference based on the current timestep's activation statistics.
- Data-Free: This eliminates the need for a calibration dataset, allowing the model to adapt dynamically to the specific input prompt and timestep.

C. $\delta$ -Guided Bit Switching ( $\delta$ -GBS) for Adaptive Precision

Insight: Not all denoising timesteps are equally critical. Some steps involve marginal feature changes (redundant), while others involve significant transformations (critical).
Mechanism:
- Feature Tracking: The method monitors the normalized $L_1$ distance between latent features of consecutive timesteps ( $L_1(F, t) = \|F_t - F_{t-1}\|_1 / \|F_{t-1}\|_1$ ).
- Adaptive Switching:
  - If cumulative feature change is below a threshold $\delta$ , the model switches to low-bit precision (e.g., 4-bit) to save compute.
  - If the change exceeds $\delta$ , it switches to high-bit precision (e.g., 8-bit) to preserve critical details.
- Benefit: This enables mixed-precision inference (e.g., W4A6) that optimizes bit allocation based on content complexity, incurring negligible overhead.

3. Key Contributions

Systematic Analysis: Identified three key characteristics of Video DiTs: Gaussian-like weight distributions, substantial activation scale discrepancies across timesteps, and latent feature variations.
Novel Algorithms:
- BGR: A data-free weight quantization scheme that reduces error by ~88% compared to MinMax.
- ARQ: A calibration-free activation quantization method combining rotation and online scaling.
- $\delta$ -GBS: The first adaptive temporal-wise mixed-precision mechanism for video diffusion.
First W4A4 Video DiT: Successfully enabled W4A4 Post-Training Quantization for Video DiTs without compromising video quality, a feat previously unattainable.

4. Experimental Results

Experiments were conducted on HunyuanVideo and Wan2.1 using the VBench benchmark suite.

Quantitative Performance:
- W4A6 (Mixed Precision): DVD-Quant achieves an Imaging Quality of 64.22, nearly matching the full-precision BF16 baseline (64.78) and significantly outperforming the best W4A8 baseline (ViDiT-Q: 59.74).
- W4A4 (Extreme Low Bit): DVD-Quant maintains an Imaging Quality of 61.82 and Aesthetic Quality of 61.96. In contrast, baseline methods (MinMax, SmoothQuant, ViDiT-Q) collapse, with Imaging Quality dropping to as low as 24.78 or 40.10.
- Speedup: DVD-Quant achieves approximately 2× speedup on advanced DiT models. When combined with caching (TeaCache), the speedup reaches 4.85× for W4A4.
Qualitative Performance:
- Visual comparisons show DVD-Quant preserves fine details (e.g., textures, launch towers) and temporal coherence where baselines produce noise, washed-out textures, or incoherent motion.
Efficiency:
- Memory: Reduces memory usage by 3.68× compared to BF16.
- Latency: Achieves up to 2.12× latency reduction for W4A4.

5. Significance

Deployment Feasibility: DVD-Quant removes the dependency on heavy calibration datasets, making it a "plug-and-play" solution for deploying large-scale video generation models on resource-constrained hardware (e.g., consumer GPUs).
Breaking the W4A4 Barrier: It proves that 4-bit quantization is viable for video generation, a domain previously thought to require higher precision (W8A8 or W8A4) to maintain quality.
Dynamic Adaptation: By leveraging the temporal nature of diffusion, the method introduces a new paradigm of content-aware, timestep-adaptive quantization, optimizing resources exactly where they are needed.
Open Source: The authors commit to releasing code and models, facilitating further research in efficient video generation.

In summary, DVD-Quant represents a significant leap in making high-fidelity video generation models practical for real-world deployment by solving the critical bottlenecks of calibration overhead and low-bit performance collapse.

DVD-Quant: Data-free Video Diffusion Transformers Quantization

1. The "Smart Ruler" (Bounded-init Grid Refinement)

2. The "Dynamic Camera" (Auto-scaling Rotated Quantization)

3. The "Traffic Cop" (δ-Guided Bit Switching)

The Grand Finale: What Does It Achieve?

1. Problem Statement

2. Methodology: DVD-Quant

A. Bounded-init Grid Refinement (BGR) for Weights

B. Auto-scaling Rotated Quantization (ARQ) for Activations

C. δ\deltaδ-Guided Bit Switching (δ\deltaδ-GBS) for Adaptive Precision

3. Key Contributions

4. Experimental Results

5. Significance

More like this

On the security of 2-key triple DES

Security issues in a group key establishment protocol

The impact of quantum computing on real-world security: A 5G case study

Yet another insecure group key distribution scheme using secret sharing

How not to secure wireless sensor networks: A plethora of insecure polynomial-based key pre-distribution schemes

C. $\delta$ -Guided Bit Switching ( $\delta$ -GBS) for Adaptive Precision