S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Imagine you have a super-talented, world-class chef (the AI model) who can cook up incredibly realistic videos of anything you can imagine—from a panda surfing at sunset to a robot DJ in Tokyo. This chef is amazing, but they are also huge. They carry a massive kitchen full of ingredients (billions of parameters), and cooking a single dish takes a long time and requires a giant, expensive stove (lots of computer power and memory).

Most people can't afford this giant kitchen. They want to put this chef in a small, portable lunchbox (like a smartphone or a standard laptop) so they can cook on the go.

The problem? If you try to shrink the chef's massive recipe book down to fit in a lunchbox, the food usually tastes terrible. The flavors get muddy, the textures disappear, and the video looks like a blurry mess. This is what happens when we try to "compress" these video AI models.

Enter S2Q-VDiT, the paper's new solution. Think of it as a master chef's assistant who knows exactly how to pack the lunchbox without ruining the meal. Here's how they do it, using two simple tricks:

1. The "Smart Shopping List" (Salient Data Selection)

Usually, when you try to shrink a recipe, you just grab a random handful of ingredients to test the new, smaller version. But for video AI, the "ingredients" (data samples) are huge and complex. If you pick the wrong ones, the whole lunchbox fails.

The authors realized that not all ingredients are created equal. Some moments in a video are boring (just a static sky), while others are critical (a sudden explosion or a character's face changing expression).

The Old Way: Picking random ingredients to test the shrinkage.
The S2Q-VDiT Way: They use a "Hessian-aware" scanner (a fancy math tool) to find the most important moments. They ask: "Which of these video frames will break the model if we shrink it?" and "Which frames teach the model the most?"
The Analogy: Instead of tasting 100 random spoonfuls of soup to see if it's salty, they only taste the one spoonful that has the most salt and the most flavor. This ensures the "shrunken" recipe is perfect because it was tested on the most critical parts.

2. The "Spotlight on the Stars" (Sparse Token Distillation)

Video AI models break a video down into thousands of tiny pieces called "tokens" (like pixels, but for time and space). When the model looks at a video, it pays attention to all of them. But here's the secret: The model only really cares about a few of them.

Imagine a movie scene with 1,000 people in the background and one main actor in the foreground. The model spends 90% of its energy on the main actor and barely glances at the crowd.

The Old Way: When shrinking the model, the old methods treated every single person in the crowd and the main actor equally. They tried to compress the background crowd just as hard as the main actor, wasting effort and ruining the focus.
The S2Q-VDiT Way: They look at the model's "attention map" (a spotlight) and realize, "Hey, only the top 10% of these tokens actually matter!"
The Analogy: Instead of trying to shrink the whole stadium equally, they put a spotlight on the main actor. They say, "We will keep the main actor's details crystal clear, but we can safely blur out the crowd in the back because nobody is looking at them anyway." This allows them to shrink the model massively without losing the quality of the important parts.

The Result?

By using these two tricks, the authors managed to shrink the video AI model by 4 times (fitting a giant into a lunchbox) and make it run 1.3 times faster, all while keeping the video quality looking almost identical to the original giant version.

In short:
They didn't just throw away half the ingredients; they picked the best ingredients and focused only on the stars of the show. This allows us to run super-smart video AI on devices that previously couldn't handle them, making high-quality video generation accessible to everyone, not just those with massive supercomputers.

1. Problem Statement

Video Diffusion Transformers (V-DMs) have become the state-of-the-art paradigm for video generation, but their deployment is hindered by massive computational and memory costs due to billions of parameters and the exponential growth of token sequences (spatial $\times$ temporal dimensions).

While Post-Training Quantization (PTQ) is a standard solution for model compression, applying it directly to V-DMs faces two critical challenges that cause significant performance degradation compared to Image Diffusion Models (I-DMs):

High Calibration Data Variance: V-DMs require long token sequences, severely limiting the number of calibration samples that can be processed under fixed computational budgets (e.g., dozens of samples vs. thousands for I-DMs). Existing random or uniform sampling strategies fail to select representative data, leading to unstable quantization performance.
Inefficient Token Optimization: V-DMs exhibit sparse attention patterns, where only a small subset of tokens significantly influences the final output. Standard PTQ methods treat all tokens equally during loss alignment, wasting optimization capacity on irrelevant tokens and failing to focus on the critical ones.

2. Methodology: S2Q-VDiT

The authors propose S2Q-VDiT, a post-training quantization framework designed specifically for V-DMs. It consists of two core components:

A. Hessian-aware Salient Data Selection (SDS)

To address the calibration data sensitivity, the authors introduce a principled method for selecting high-quality calibration samples. They define a unified salience score based on two dimensions:

Diffusion Salience ( $C_{diff}$ ): Measures the informativeness of a timestep in the denoising process. It calculates the normalized difference between consecutive latent representations ( $||x_t - x_{t-1}||^2 / ||x_t||^2$ ). High values indicate steps where significant information is added.
Quantization Salience ( $C_{quant}$ ): Measures the sensitivity of a sample to quantization perturbations. Using a Taylor expansion of the quantization loss and the Hessian matrix approximation ( $X^T X$ ), they define this as $||x_t^T x_t||^2$ . High values indicate samples where quantization errors would be most damaging.

The final selection metric is the product of the normalized Diffusion and Quantization salience scores:
$C_{sample} = C_{diff} \cdot C_{quant}$
This ensures the selected dataset contains samples that are both informative for the diffusion process and sensitive to quantization, maximizing the robustness of the calibration.

B. Attention-guided Sparse Token Distillation (STD)

To address the optimization challenge, the authors leverage the inherent sparsity in V-DM attention maps.

Observation: Attention heatmaps reveal that only a small fraction (e.g., top 10%) of tokens carry the majority of the attention weight.
Mechanism: Instead of minimizing the quantization loss uniformly across all tokens, the method reweights the loss function based on token-wise attention distributions.
Implementation: For each token $j$ , a weighting factor $\lambda_j$ is computed by aggregating attention weights from the multi-head attention map and normalizing them. The quantization loss becomes:
$L_{quant} = \frac{1}{n} \sum_{j=1}^{n} \lambda_j ||\theta_f(x_{j,:}) - \theta_q(x_{j,:})||^2$
This forces the optimization process to prioritize aligning the quantized model with the full-precision model on the most influential tokens, effectively ignoring noise from less important tokens.

3. Key Contributions

Identification of V-DM Specific Challenges: The paper empirically demonstrates that V-DM quantization is highly sensitive to calibration data selection and suffers from inefficiencies due to uniform token treatment, issues less prominent in I-DMs.
Novel Calibration Strategy (SDS): The introduction of Hessian-aware Salient Data Selection, which jointly optimizes for diffusion informativeness and quantization sensitivity, significantly reduces performance variance in low-data regimes.
Novel Optimization Strategy (STD): The proposal of Attention-guided Sparse Token Distillation, which dynamically reweights token losses based on attention maps, enabling better convergence with limited calibration data.
State-of-the-Art Performance: The framework achieves lossless or near-lossless performance on large-scale V-DMs (2B to 13B parameters) under aggressive quantization settings (W4A6 and even W4A4).

4. Experimental Results

The authors evaluated S2Q-VDiT on three major video generation models: CogVideoX-2B, CogVideoX-5B, and HunyuanVideo-13B.

Quantization Settings: Tested on W4A6 (4-bit weights, 6-bit activations) and the extremely challenging W4A4 (4-bit weights, 4-bit activations).
Performance Metrics: Evaluated using the VBench benchmark suite (Imaging Quality, Aesthetic Quality, Motion Smoothness, etc.) and EvalCrafter.
Key Findings:
- W4A6: S2Q-VDiT achieved lossless performance across all models, often matching or slightly exceeding Full-Precision (FP) baselines in specific metrics (e.g., Scene Consistency on CogVideoX-5B). It significantly outperformed existing baselines like Q-DiT, PTQ4DiT, ViDiT-Q, and SmoothQuant.
- W4A4: In this ultra-low-bit setting where other methods collapsed (generating incoherent or static images), S2Q-VDiT maintained ~95% of the original performance. For example, on CogVideoX-2B, it achieved a Scene Consistency score of 34.23 compared to the FP score of 33.79, while the next best method scored only 12.21.
- Efficiency: The method achieved 3.9 $\times$ model compression and 1.3 $\times$ inference acceleration with negligible overhead in calibration time (only ~0.2 hours and 2GB extra memory compared to baselines).

5. Significance

Feasibility of Low-Bit Video Generation: This work proves that high-quality video generation is feasible on resource-constrained hardware (e.g., consumer GPUs) using 4-bit weights and 6-bit activations without sacrificing visual fidelity.
Paradigm Shift in PTQ: It moves the focus of PTQ for diffusion models from purely architectural quantizer design to data-centric (salient data selection) and optimization-centric (sparse token distillation) strategies.
Generalizability: The techniques (SDS and STD) are shown to be compatible with existing PTQ frameworks, suggesting they can be integrated into other quantization pipelines to improve performance.
Practical Deployment: By reducing model storage by nearly 4 $\times$ and accelerating inference, S2Q-VDiT paves the way for deploying large-scale video generation models in real-world applications where memory and latency are critical constraints.

S2^22Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

1. The "Smart Shopping List" (Salient Data Selection)

2. The "Spotlight on the Stars" (Sparse Token Distillation)

The Result?

1. Problem Statement

2. Methodology: S2Q-VDiT

A. Hessian-aware Salient Data Selection (SDS)

B. Attention-guided Sparse Token Distillation (STD)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers

S $^2$ Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation