QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

QuantSparse is a unified framework that effectively combines model quantization and attention sparsification for video diffusion transformers by introducing Multi-Scale Salient Attention Distillation and Second-Order Sparse Attention Reparameterization to mitigate information loss, thereby achieving significant storage reduction and inference acceleration while substantially outperforming existing baselines in generation quality.

Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu

Published 2026-03-10
📖 5 min read🧠 Deep dive

🎬 The Problem: The "Hollywood Blockbuster" That Won't Fit in Your Pocket

Imagine you have a Hollywood blockbuster movie (a high-quality AI video generator like HunyuanVideo or Wan2.1). This movie is amazing, but it's so huge and heavy that:

  1. It needs a massive warehouse to store the film reels (it takes up 20GB+ of computer memory).
  2. It takes a whole day to project a single scene (it takes nearly an hour to generate a video).

Because of this, you can't run these movies on a normal laptop or phone. They are too expensive and slow for real life.

Scientists have tried two tricks to make these movies smaller and faster:

  • Trick A (Quantization): Compressing the film reel. Instead of using high-definition 4K color, they use a simpler, lower-quality palette (like turning a 16-bit color image into a 4-bit sketch). This saves space but can make the picture look grainy or weird.
  • Trick B (Sparsification): Ignoring parts of the movie. They decide that 85% of the pixels are "boring" and just delete them, only keeping the important parts. This makes it fast, but if you delete too much, the movie looks like it's missing chunks.

The Catch: If you try to do both at the same time (compress the colors AND delete the pixels), the movie falls apart. The graininess from the compression makes the missing chunks look even worse. It's like trying to listen to a radio station that is both static-filled and missing half the broadcast.

🚀 The Solution: QuantSparse (The "Smart Editor")

The authors of this paper created a new tool called QuantSparse. Think of it as a super-smart film editor that knows exactly how to compress and cut the movie without ruining the story. It uses two special techniques to fix the problems mentioned above.

1. The "Map & Highlight" Trick (Multi-Scale Salient Attention Distillation)

The Problem: When you compress the movie (Trick A), the "map" of where the camera should look gets blurry. The AI gets confused about what is important.

The Solution: The editor uses a two-step guide to keep the AI on track:

  • The Global Map (Low-Res): The editor first looks at a tiny, blurry thumbnail of the whole movie. This helps the AI understand the big picture (e.g., "This is a beach scene"). It's cheap to compute and keeps the structure right.
  • The Highlighter (High-Res): The editor then zooms in on the most important parts of the movie (like a sea turtle swimming or a cliff edge). It tells the AI, "Ignore the boring water, but pay super close attention to the turtle."

The Analogy: Imagine you are trying to memorize a map of a city.

  • Old Way: You try to memorize every single crack in the sidewalk (too much info) or just look at a blurry photo (not enough info).
  • QuantSparse Way: You look at a small map to see the main roads (Global), and then you use a highlighter to mark the specific coffee shops you need to visit (Local). You get the best of both worlds without memorizing the whole city.

2. The "Time-Traveling Glue" (Second-Order Sparse Attention Reparameterization)

The Problem: Videos are made of frames that happen one after another. If you delete pixels (Sparsification), you lose some "glue" that holds the movement together. Usually, the AI tries to guess the missing glue by looking at the previous frame. But because the movie is also compressed (Quantization), the "glue" changes slightly every second, making the guess wrong.

The Solution: The editor realizes that while the "glue" changes, the way it changes is actually very stable.

  • First-Order (The Old Way): "The glue was here yesterday, so it's probably here today." (Often wrong because of compression noise).
  • Second-Order (The New Way): "The glue moved this specific direction yesterday, and the change in that movement is very predictable."

The Analogy: Imagine you are walking down a hallway.

  • First-Order: You guess your next step based on where you are now. If the floor is slippery (compression noise), you might slip.
  • Second-Order: You notice that even though the floor is slippery, your tendency to slip to the left is very consistent. So, you adjust your walk based on that consistent pattern. You aren't just guessing your position; you are predicting the change in your position, which is much more stable.

By using this "Time-Traveling Glue," QuantSparse can fill in the missing pieces of the video so accurately that the final result looks almost identical to the original, huge movie.

🏆 The Results: The Magic Numbers

When they tested this on the biggest video models (like HunyuanVideo and Wan2.1):

  • Storage: They shrunk the model size by 3.8 times (like turning a 50GB hard drive into a 13GB one).
  • Speed: They made the video generation 1.8 times faster.
  • Quality: The video quality remained almost perfect. In fact, on some tests, it looked better than other compressed methods because it focused so well on the important details.

🎯 The Takeaway

QuantSparse is like a master chef who can take a giant, expensive, slow-to-cook feast and turn it into a quick, portable meal without losing any flavor.

  • Before: You needed a supercomputer to watch AI videos.
  • Now: With QuantSparse, we can run these high-quality video generators on much smaller, cheaper devices, making them ready for real-world use (like on your phone or in a web browser).

It solves the "impossible triangle" of AI: Fast + Small + High Quality. Usually, you can only pick two. QuantSparse proves you can have all three.