BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

This paper proposes BWCache, a training-free method that accelerates DiT-based video generation by dynamically caching and reusing block features across diffusion timesteps based on a similarity threshold, achieving up to a 6×\times speedup while maintaining visual fidelity.

Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are directing a massive, high-tech movie production. The script is your text prompt (e.g., "A cat riding a skateboard at sunset"), and the actors are the pixels on your screen.

In the world of AI video generation, the current stars are called Diffusion Transformers (DiTs). Think of them as incredibly talented but extremely slow actors. To create a single second of video, they have to perform a long, repetitive dance called "denoising." They start with a screen full of static (like TV snow) and slowly, step-by-step, clean it up until a clear image appears.

The Problem:
This dance is sequential. The actor has to do Step 1, then Step 2, then Step 3, all the way to Step 30. It takes a long time, and the computer gets exhausted (high latency), making it hard to use for real-time applications.

The Old Solutions:

  • The "Architect" Approach: Some tried to build a smaller, faster stage (pruning or distilling the model). But this often made the actors forget their lines, resulting in blurry or weird videos.
  • The "Lazy" Approach: Others tried to skip steps entirely. But if you skip too many, the movie looks like a glitchy mess.

The New Solution: BWCache (Block-Wise Caching)
The authors of this paper, Hanshuai Cui and his team, realized something brilliant while watching these actors rehearse. They noticed that in the middle of the performance, the actors barely move.

The "U-Shape" Discovery

Imagine the actor's energy levels over the course of the 30 steps:

  1. Start (The Beginning): The actor is frantic, making huge changes to the static noise. Everything is chaotic.
  2. Middle (The Middle): The scene is mostly set. The actor is just making tiny, almost invisible adjustments. They are essentially standing still, holding the pose.
  3. End (The Finale): The actor goes into overdrive again to add the final, crisp details (the sparkle in the eye, the texture of the fur).

The authors found that during that Middle phase, the computer is doing a lot of work for almost no change in the picture. It's like a chef chopping vegetables for a soup that has already been simmering for an hour; the chopping is redundant.

How BWCache Works: The "Freeze-Frame" Trick

Instead of forcing the computer to recalculate every single step, BWCache introduces a smart manager with a clipboard.

  1. The Check-In: At every step, the manager looks at the "actors" (the internal parts of the AI called DiT blocks).
  2. The Similarity Test: The manager asks, "Did the actor move much since the last step?"
    • If YES (Big Change): The manager says, "Keep working! We need to see this." The computer calculates the new frame.
    • If NO (Tiny Change): The manager says, "Hold that pose! You're just repeating yourself." The computer skips the calculation and just re-uses the image from the previous step.
  3. The Safety Net: To make sure the video doesn't get "stuck" or blurry from reusing the same image too long, the manager forces a "rehearsal" every few steps to refresh the details.

Why This is a Game-Changer

Think of it like watching a movie on a streaming service.

  • Without BWCache: Your internet has to download every single frame, even the ones where the camera is just panning slowly across a static wall.
  • With BWCache: The service realizes, "Hey, this wall isn't moving." It sends you the same image data for the next few seconds, saving massive amounts of data and time.

The Results:

  • Speed: They made the video generation 2.6 times faster.
  • Quality: Because they are smart about when to skip (only when the image is truly stable), the video looks just as good as the slow version. It doesn't look blurry or glitchy.
  • No Training Needed: You don't have to teach the AI a new way to act. You just give it this new "manager" (the caching system) to run the show. It works with existing models like Open-Sora, HunyuanVideo, and others.

In a Nutshell

BWCache is like a smart traffic light for AI video generation. It knows exactly when the traffic (computation) is heavy and needs to flow, and when the road is clear enough to let cars (frames) coast without an engine running. It saves time and energy without causing a traffic jam (quality loss).