S2^2Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

The paper proposes S2^2Q-VDiT, a post-training quantization framework for video diffusion transformers that achieves lossless performance under W4A6 quantization by utilizing Hessian-aware salient data selection and attention-guided sparse token distillation to overcome calibration variance and learning challenges caused by long token sequences.

Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you have a super-talented, world-class chef (the AI model) who can cook up incredibly realistic videos of anything you can imagine—from a panda surfing at sunset to a robot DJ in Tokyo. This chef is amazing, but they are also huge. They carry a massive kitchen full of ingredients (billions of parameters), and cooking a single dish takes a long time and requires a giant, expensive stove (lots of computer power and memory).

Most people can't afford this giant kitchen. They want to put this chef in a small, portable lunchbox (like a smartphone or a standard laptop) so they can cook on the go.

The problem? If you try to shrink the chef's massive recipe book down to fit in a lunchbox, the food usually tastes terrible. The flavors get muddy, the textures disappear, and the video looks like a blurry mess. This is what happens when we try to "compress" these video AI models.

Enter S2Q-VDiT, the paper's new solution. Think of it as a master chef's assistant who knows exactly how to pack the lunchbox without ruining the meal. Here's how they do it, using two simple tricks:

1. The "Smart Shopping List" (Salient Data Selection)

Usually, when you try to shrink a recipe, you just grab a random handful of ingredients to test the new, smaller version. But for video AI, the "ingredients" (data samples) are huge and complex. If you pick the wrong ones, the whole lunchbox fails.

The authors realized that not all ingredients are created equal. Some moments in a video are boring (just a static sky), while others are critical (a sudden explosion or a character's face changing expression).

  • The Old Way: Picking random ingredients to test the shrinkage.
  • The S2Q-VDiT Way: They use a "Hessian-aware" scanner (a fancy math tool) to find the most important moments. They ask: "Which of these video frames will break the model if we shrink it?" and "Which frames teach the model the most?"
  • The Analogy: Instead of tasting 100 random spoonfuls of soup to see if it's salty, they only taste the one spoonful that has the most salt and the most flavor. This ensures the "shrunken" recipe is perfect because it was tested on the most critical parts.

2. The "Spotlight on the Stars" (Sparse Token Distillation)

Video AI models break a video down into thousands of tiny pieces called "tokens" (like pixels, but for time and space). When the model looks at a video, it pays attention to all of them. But here's the secret: The model only really cares about a few of them.

Imagine a movie scene with 1,000 people in the background and one main actor in the foreground. The model spends 90% of its energy on the main actor and barely glances at the crowd.

  • The Old Way: When shrinking the model, the old methods treated every single person in the crowd and the main actor equally. They tried to compress the background crowd just as hard as the main actor, wasting effort and ruining the focus.
  • The S2Q-VDiT Way: They look at the model's "attention map" (a spotlight) and realize, "Hey, only the top 10% of these tokens actually matter!"
  • The Analogy: Instead of trying to shrink the whole stadium equally, they put a spotlight on the main actor. They say, "We will keep the main actor's details crystal clear, but we can safely blur out the crowd in the back because nobody is looking at them anyway." This allows them to shrink the model massively without losing the quality of the important parts.

The Result?

By using these two tricks, the authors managed to shrink the video AI model by 4 times (fitting a giant into a lunchbox) and make it run 1.3 times faster, all while keeping the video quality looking almost identical to the original giant version.

In short:
They didn't just throw away half the ingredients; they picked the best ingredients and focused only on the stars of the show. This allows us to run super-smart video AI on devices that previously couldn't handle them, making high-quality video generation accessible to everyone, not just those with massive supercomputers.