WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

WorldCache is a novel caching framework that accelerates diffusion-based world models by up to 3.7×\times without retraining, overcoming challenges of token heterogeneity and non-uniform temporal dynamics through curvature-guided prediction and chaotic-prioritized adaptive skipping to maintain high rollout quality.

Weilun Feng, Guoxin Fan, Haotong Qin, Chuanguang Yang, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Dingrui Wang, Longlong Liao, Michele Magno, Yongjun Xu

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are trying to predict the future of a complex, moving world—like a video game where you can walk around, look at objects, and see how the sun moves across the sky. This is what "World Models" do. They are AI systems that try to simulate reality.

However, there's a big problem: these simulations are incredibly slow and expensive to run. To generate just a few seconds of video, the AI has to take hundreds of tiny steps, recalculating the entire scene from scratch every single time. It's like trying to paint a masterpiece by repainting the entire canvas from the first brushstroke to the last, every time you want to add one new detail.

WorldCache is a new method that speeds this up by 3.7 times without making the picture look worse. It does this by being "smart" about what it skips.

Here is how it works, using a simple analogy:

The Problem: The "Lazy" vs. The "Chaotic"

Imagine you are watching a movie.

  • The Sky: The clouds move slowly and predictably. If you know where they were 5 seconds ago, you can guess where they are now with almost 100% accuracy.
  • The Car Crash: Suddenly, a car swerves, crashes, and sparks fly. This is chaotic. If you try to guess what happens next based on the last 5 seconds, you will be completely wrong.

Old AI methods treated the whole movie the same way. They either:

  1. Reused everything: They just copied the last frame. This worked for the sky but made the car crash look like a blurry, frozen mess.
  2. Calculated everything: They recalculated every single pixel for every step. This was accurate but took forever.

The Solution: WorldCache

WorldCache acts like a smart director who knows exactly which parts of the scene need attention and which parts can be ignored. It uses two main tricks:

1. The "Curvature" Compass (Spotting the Chaos)

The AI looks at every tiny piece of the image (called a "token") and asks: "Is this piece moving in a straight line, or is it twisting and turning wildly?"

  • Stable Tokens (The Sky): These move in a straight, boring line. The AI says, "I know what you're doing. I'll just copy your last move." (This is Reuse).
  • Linear Tokens (A walking person): These move predictably but are changing speed slightly. The AI says, "I can guess your next move by drawing a straight line from your last two positions." (This is Linear Extrapolation).
  • Chaotic Tokens (The car crash): These are twisting, turning, and changing direction instantly. The AI says, "Oh no, this is dangerous! I cannot guess this. I must stop and calculate this part from scratch." (This is the Damped Update).

2. The "Drift" Alarm (Knowing When to Stop Guessing)

Even for the chaotic parts, the AI doesn't want to calculate every step if it doesn't have to. So, it sets up a Drift Alarm.

Imagine you are walking on a tightrope. As long as you are steady, you don't need a safety net. But if you start to wobble (drift), the net needs to catch you.

  • WorldCache watches the "chaotic" parts closely.
  • It measures how much the AI's guess is "drifting" away from reality.
  • Crucially: It ignores the stable parts (the sky) because they aren't drifting. It only listens to the chaotic parts.
  • As soon as the chaotic parts start to wobble too much, the alarm rings, and the AI stops guessing and does the hard math again.

Why is this a big deal?

Most previous methods were like a student who tries to memorize the whole textbook to answer one question, or a student who guesses randomly and fails.

WorldCache is like a tutor who knows exactly which questions are hard.

  • It spends 90% of its time quickly glancing at the easy stuff (the sky, the walls).
  • It only stops to do the heavy lifting when the "chaotic" stuff (moving cars, complex interactions) starts to get messy.

The Result

By doing this, WorldCache makes the AI 3.7 times faster (like turning a 10-minute wait into a 3-minute wait) while keeping the video quality almost identical to the slow, expensive version. It allows us to run these complex "world simulations" on standard computers, making interactive AI and virtual reality much more practical for everyone.

In short: WorldCache stops the AI from wasting energy on things that are easy to predict, so it can focus its brainpower on the things that are actually changing and difficult.