CHAI: CacHe Attention Inference for text2video

CHAI accelerates text-to-video diffusion inference by introducing Cache Attention, a mechanism that reuses cached latents across semantically related prompts to achieve 1.65x–3.35x speedups over OpenSora 1.2 while maintaining high video quality with as few as 8 denoising steps.

Joel Mathew Cherian, Ashutosh Muralidhara Bharadwaj, Vima Gupta, Anand Padmanabha Iyer

Published 2026-02-19
📖 4 min read☕ Coffee break read

Imagine you are an artist tasked with painting a movie scene from scratch, frame by frame. You start with a blank canvas covered in static noise (like TV snow). To get a clear picture of a "sunny beach," you have to slowly wipe away the noise, refining the image step-by-step.

The Problem:
Current AI video generators (like OpenSora) are like incredibly talented but slow painters. To make a smooth, high-quality video, they need to wipe away the noise 30 to 50 times for every single frame. This takes a long time, making it hard to use these tools for real-time applications or quick projects.

The Old Solutions:

  1. Retraining: Teaching the painter a new, faster way to paint. This is expensive and takes months.
  2. Skipping Steps: Telling the painter, "Just skip the middle steps, I'll guess what happens." This is fast, but the painting often looks blurry, glitchy, or the objects melt into each other.

The New Solution: CHAI (CacHe Attention Inference)
The authors of this paper, CHAI, came up with a clever trick. Instead of forcing the painter to start from total noise every time, they say: "Let's look at what we painted before."

Here is how CHAI works, explained with simple analogies:

1. The "Whole Prompt" vs. "Entity" Mistake

Imagine you ask an AI to draw "A tiger running in a jungle."

  • The Old Way (NIRVANA): The system looks at your entire sentence. It asks, "Have I ever drawn a tiger running in a jungle before?" If the answer is "No, I only drew a tiger sitting in a cave," it says, "No match! Start from scratch."
  • The CHAI Way: CHAI is smarter. It breaks the sentence down into entities (objects and scenes). It asks, "Have I ever drawn a tiger?" Yes. "Have I ever drawn a jungle?" Yes.
    • Analogy: It's like a chef. If you order "Spicy Chicken Tacos," and the chef has already made "Spicy Chicken Burritos," a normal chef might say, "That's a different dish, I'll start from zero." A CHAI chef says, "I already have the spicy chicken sauce and the tortillas ready! I just need to change the shape."

2. The "Magic Attention" Lens (Cache Attention)

This is the secret sauce. You can't just copy-paste the old "tiger" into the new video, because the new video might need the tiger to be running, not sitting.

CHAI introduces a mechanism called Cache Attention.

  • Analogy: Imagine the old video is a library of reference photos. When the AI starts painting the new video, it doesn't just grab the whole photo. Instead, it puts on special glasses (the attention mechanism).
  • These glasses let the AI look at the old "tiger" photo and say, "Okay, I'll use the shape of the tiger and the color of the fur from the old photo, but I will ignore the 'sitting' pose."
  • It selectively "picks out" the useful parts (the entities) from the old work and mixes them with the new instructions. This allows the AI to skip the boring, repetitive early steps of figuring out "what a tiger looks like" and jump straight to "how to make this tiger run."

3. The Result: Speed without the Blur

Because CHAI reuses the "skeleton" of previous videos (the objects and scenes), it doesn't need to start from scratch.

  • The Baseline: Needs 30 steps to paint a perfect video.
  • CHAI: Only needs 8 steps.

The Analogy:

  • Standard AI: Like building a house from the ground up every time you want a new room. You have to pour the concrete, lay the bricks, and frame the walls from scratch.
  • CHAI: Like having a warehouse of pre-built walls and windows. If you want a new room, you just grab the pre-made walls (the cached entities), assemble them, and paint the final details. You skip the heavy lifting.

Why is this a big deal?

  • Speed: It makes video generation 1.6 to 3.3 times faster.
  • Quality: Unlike other "fast" methods that make videos look like glitchy messes, CHAI keeps the quality almost identical to the slow, perfect version.
  • Efficiency: It works even if you only have a small amount of computer memory (cache), because it's smart enough to know that "beach" and "ocean" are similar enough to reuse parts of the work.

In a nutshell:
CHAI is like a super-efficient assistant who remembers the details of your past projects. Instead of making you start over every time you ask for a video, it says, "I've already done the hard part of figuring out what a 'beach' looks like. Let's just tweak that for your new request." This saves hours of waiting time while keeping the video looking crisp and clear.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →