Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration

This paper proposes MixCache, a training-free framework that accelerates video DiT inference by employing a context-aware triggering mechanism and an adaptive hybrid strategy to dynamically select optimal caching granularities, thereby significantly improving both generation speed and quality.

Yuanxin Wei, Lansong Diao, Bujiao Chen, Shenggan Cheng, Zhengping Qian, Wenyuan Yu, Nong Xiao, Wei Lin, Jiangsu Du

Published 2026-02-27
📖 5 min read🧠 Deep dive

Imagine you are an artist trying to paint a beautiful, complex movie scene frame by frame. You start with a blank canvas covered in static noise (like TV snow). To get the final image, you have to slowly "denoise" the picture, refining it step-by-step.

In the world of AI video generation (specifically using models called Video DiTs), this process is like the AI taking 50 to 100 tiny steps to turn that static noise into a clear video. The problem? It's incredibly slow. If you want to generate a 5-second video, it might take the computer 50 minutes to think through every single step.

This paper introduces MixCache, a clever "smart shortcut" system that makes this process nearly twice as fast without ruining the quality of the movie.

Here is how it works, using some everyday analogies:

1. The Problem: The "Over-Thinker"

Imagine you are walking through a foggy forest. To find your way, you stop every few feet to check your map, look at the trees, and confirm your direction.

  • The AI's current method: It stops at every single step to do a full, detailed calculation, even when the scenery hasn't changed much.
  • The result: You get to the destination safely, but it takes forever because you are doing unnecessary work.

2. The Solution: The "Smart Shortcut" (MixCache)

The researchers realized that in the middle of the journey, the scenery doesn't change much between steps. Sometimes, the "conditional" instructions (what you asked for) look very similar to the "unconditional" ones (what happens if you ask for nothing). Sometimes, the middle parts of the painting process look exactly the same as the previous step.

MixCache is like a smart guide who knows when to stop and think, and when to just keep walking using the last known information.

It uses three types of shortcuts (Granularities):

  1. Step Level: "Hey, the whole picture looks the same as the last frame. Let's just copy the last frame instead of painting a new one."
  2. CFG Level: "The instructions for 'a cat' and 'no instructions' are giving us very similar results right now. Let's just reuse the 'no instructions' result."
  3. Block Level: "The middle layers of the painting process haven't changed. Let's skip re-painting the middle and just use the old paint."

3. The Magic: Why "Hybrid" Matters

Previous methods were like a driver who only knew one trick: either they always skipped steps, or they always skipped middle layers.

  • The Flaw: If you skip too much too early (when the fog is thick), you might paint a monster instead of a cat. If you don't skip enough when the fog clears, you waste time.

MixCache is the "Adaptive Hybrid" driver. It doesn't stick to one rule. It constantly asks:

  • "Is the fog clearing?" (Context-aware Triggering)
  • "Which shortcut is safest right now? Should I skip the whole step, or just the middle part?" (Adaptive Decision)

It uses a Penalty System to keep things interesting. If it just used the "Step Level" shortcut three times in a row, it might get lazy and miss a detail. So, MixCache forces itself to switch up the type of shortcut it uses, ensuring it stays alert and the video stays high quality.

4. The "Warm-Up" Phase

Just like a car engine needs to warm up before you can drive fast, the AI needs to do the first few steps of the video generation without any shortcuts. This is the "Warm-Up Phase."

  • Why? At the very beginning, the AI is deciding the entire structure of the video (the skeleton). If you skip steps here, the video might look like a glitchy mess.
  • MixCache's trick: It waits until the "fog" clears (the image stabilizes) before it starts using its shortcuts.

5. The Results: Speed vs. Quality

The paper tested this on massive AI models (like Wan 14B and HunyuanVideo).

  • Without MixCache: Generating a video takes a long time (e.g., 50 minutes).
  • With MixCache: It cuts that time almost in half (e.g., down to 25 minutes), which is a 1.9x to 2x speedup.
  • The Quality: The videos look almost identical to the slow, perfect versions. The "smart guide" didn't take a wrong turn; it just took the highway instead of the dirt road.

Summary

Think of MixCache as a smart traffic controller for AI video generation. Instead of forcing the AI to stop and calculate every single detail at every intersection, the controller looks at the traffic, sees that the road is clear, and says, "Okay, you can skip this intersection and the next one, but let's slow down for the tricky turn ahead."

This allows the AI to generate high-quality videos twice as fast, making it possible to create movies and animations in real-time or near real-time, rather than waiting hours.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →