WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

Imagine you are trying to predict the future of a complex, moving world—like a video game where you can walk around, look at objects, and see how the sun moves across the sky. This is what "World Models" do. They are AI systems that try to simulate reality.

However, there's a big problem: these simulations are incredibly slow and expensive to run. To generate just a few seconds of video, the AI has to take hundreds of tiny steps, recalculating the entire scene from scratch every single time. It's like trying to paint a masterpiece by repainting the entire canvas from the first brushstroke to the last, every time you want to add one new detail.

WorldCache is a new method that speeds this up by 3.7 times without making the picture look worse. It does this by being "smart" about what it skips.

Here is how it works, using a simple analogy:

The Problem: The "Lazy" vs. The "Chaotic"

Imagine you are watching a movie.

The Sky: The clouds move slowly and predictably. If you know where they were 5 seconds ago, you can guess where they are now with almost 100% accuracy.
The Car Crash: Suddenly, a car swerves, crashes, and sparks fly. This is chaotic. If you try to guess what happens next based on the last 5 seconds, you will be completely wrong.

Old AI methods treated the whole movie the same way. They either:

Reused everything: They just copied the last frame. This worked for the sky but made the car crash look like a blurry, frozen mess.
Calculated everything: They recalculated every single pixel for every step. This was accurate but took forever.

The Solution: WorldCache

WorldCache acts like a smart director who knows exactly which parts of the scene need attention and which parts can be ignored. It uses two main tricks:

1. The "Curvature" Compass (Spotting the Chaos)

The AI looks at every tiny piece of the image (called a "token") and asks: "Is this piece moving in a straight line, or is it twisting and turning wildly?"

Stable Tokens (The Sky): These move in a straight, boring line. The AI says, "I know what you're doing. I'll just copy your last move." (This is Reuse).
Linear Tokens (A walking person): These move predictably but are changing speed slightly. The AI says, "I can guess your next move by drawing a straight line from your last two positions." (This is Linear Extrapolation).
Chaotic Tokens (The car crash): These are twisting, turning, and changing direction instantly. The AI says, "Oh no, this is dangerous! I cannot guess this. I must stop and calculate this part from scratch." (This is the Damped Update).

2. The "Drift" Alarm (Knowing When to Stop Guessing)

Even for the chaotic parts, the AI doesn't want to calculate every step if it doesn't have to. So, it sets up a Drift Alarm.

Imagine you are walking on a tightrope. As long as you are steady, you don't need a safety net. But if you start to wobble (drift), the net needs to catch you.

WorldCache watches the "chaotic" parts closely.
It measures how much the AI's guess is "drifting" away from reality.
Crucially: It ignores the stable parts (the sky) because they aren't drifting. It only listens to the chaotic parts.
As soon as the chaotic parts start to wobble too much, the alarm rings, and the AI stops guessing and does the hard math again.

Why is this a big deal?

Most previous methods were like a student who tries to memorize the whole textbook to answer one question, or a student who guesses randomly and fails.

WorldCache is like a tutor who knows exactly which questions are hard.

It spends 90% of its time quickly glancing at the easy stuff (the sky, the walls).
It only stops to do the heavy lifting when the "chaotic" stuff (moving cars, complex interactions) starts to get messy.

The Result

By doing this, WorldCache makes the AI 3.7 times faster (like turning a 10-minute wait into a 3-minute wait) while keeping the video quality almost identical to the slow, expensive version. It allows us to run these complex "world simulations" on standard computers, making interactive AI and virtual reality much more practical for everyone.

In short: WorldCache stops the AI from wasting energy on things that are easy to predict, so it can focus its brainpower on the things that are actually changing and difficult.

1. Problem Statement

Diffusion-based world models (e.g., HunyuanVoyager, Aether) are powerful tools for simulating spatiotemporal dynamics, enabling long-horizon planning and interactive agents. However, their practical deployment is hindered by high inference costs due to the iterative denoising process requiring repeated backbone evaluations.

While feature caching (reusing or predicting intermediate features to skip backbone computations) has successfully accelerated single-modal image/video diffusion, it fails when directly applied to world models due to two unique challenges:

Token Heterogeneity & Long-Tailed Difficulty: World models process coupled multi-modal tokens (e.g., RGB appearance and 3D depth) with distinct physical dynamics. Most tokens evolve smoothly (easy to predict), but a small subset exhibits sharp, non-linear changes (e.g., motion boundaries, depth discontinuities). Uniform caching policies either waste computation on easy tokens or cause global drift by failing on these "hard" chaotic tokens.
Non-Stationary Temporal Dynamics: The difficulty of denoising steps varies significantly over time. A few "bottleneck" tokens often dominate error accumulation during specific chaotic intervals. Global-threshold heuristics either react too late (missing critical updates) or trigger too early (due to benign changes in easy tokens), leading to poor speed-quality trade-offs.

2. Methodology: WorldCache

The authors propose WorldCache, a training-free acceleration framework designed specifically for the heterogeneous nature of world models. It consists of two core components:

A. Curvature-Guided Heterogeneous Token Prediction (CHTP)

Instead of applying a single prediction rule to all tokens, WorldCache dynamically partitions tokens based on their curvature score ( $\kappa$ ), which measures the non-linearity of their temporal trajectory.

Curvature Calculation: Using the last three full backbone outputs, the method computes discrete velocity ( $v$ ) and acceleration ( $a$ ) for each token. The curvature score is defined as $\kappa = \|a\|^2 / (\|v\|^2 + \epsilon)$ .
Token Grouping: Tokens are categorized into three groups based on curvature percentiles:
- Stable ( $I_{stable}$ ): Low curvature. Strategy: Direct Reuse (0th-order).
- Linear ( $I_{linear}$ ): Moderate curvature. Strategy: Linear Extrapolation (1st-order).
- Chaotic ( $I_{chaotic}$ ): High curvature (abrupt direction changes). Strategy: Hermite-guided Damped Update. This predictor blends the current velocity with the previous velocity using a cubic Hermite schedule to prevent overshooting and drift in chaotic regions.

B. Chaotic-Prioritized Adaptive Skipping (CAS)

To determine when to trigger a full backbone evaluation, WorldCache avoids global averaging, which dilutes critical errors.

Dimensionless Drift Indicator: The authors introduce a scale-normalized drift metric. Since raw feature differences vary in magnitude across modalities and timesteps, they define a dimensionless score: $e_i(t) = \kappa_i \cdot \|\Delta y_{t,i}\|$ . This product is proven to be invariant to global feature rescaling.
Chaotic Prioritization: The system only monitors the accumulated drift of the Chaotic token subset.
Adaptive Triggering: An accumulator $E_{acc}$ sums the normalized drift of chaotic tokens. A full computation is triggered only when $E_{acc}$ exceeds a unified threshold $\eta$ . This ensures resources are allocated precisely when the most difficult tokens begin to diverge.

3. Key Contributions

Problem Identification: The paper identifies that existing caching methods fail in world models due to multi-modal token heterogeneity and non-stationary temporal regimes where bottleneck tokens drive failure.
Curvature-Guided Prediction: A novel mechanism that assigns different approximation rules (Reuse, Linear, Damped) to tokens based on their trajectory non-linearity, specifically stabilizing chaotic tokens with a damped predictor.
Chaotic-Prioritized Skipping: A unified, scale-normalized skipping strategy that focuses on the "hard" tokens, enabling aggressive skipping without destabilizing the multi-modal rollout.
Training-Free Efficiency: The framework requires no retraining, making it immediately applicable to existing diffusion world models.

4. Experimental Results

The authors evaluated WorldCache on two state-of-the-art models: HunyuanVoyager-13B and Aether-5B.

Speedup:
- Voyager-13B: Achieved up to 3.65× end-to-end speedup (reducing latency from ~1054s to ~289s).
- Aether-5B: Achieved up to 2.61× speedup (reducing latency from ~55.4s to ~21.2s).
Quality Preservation:
- Maintained 98% of rollout quality compared to the baseline.
- Outperformed all training-free baselines (e.g., TeaCache, EasyCache, DuCa) in perceptual metrics (PSNR, SSIM, LPIPS) and WorldScore benchmarks.
- 3D Reconstruction: Preserved geometry-aware capabilities, achieving near-lossless depth estimation and camera pose accuracy (e.g., matching baseline Abs Rel of 0.340).
Memory Efficiency: Unlike layer-wise caching methods that often exceed single-GPU memory limits (>100GB), WorldCache operates with negligible memory overhead (~50GB, similar to the baseline).

5. Significance

WorldCache addresses a critical bottleneck in the deployment of generative world models. By recognizing that not all tokens are created equal, it moves beyond "one-size-fits-all" caching strategies.

Practicality: It enables interactive, long-horizon simulations on resource-constrained hardware (single GPU) without sacrificing fidelity.
Generalizability: The concept of curvature-guided heterogeneous prediction and chaotic-prioritized monitoring offers a new paradigm for accelerating any diffusion model where multi-modal or complex dynamics exist.
Impact: It bridges the gap between theoretical world model capabilities and real-time application, facilitating faster iteration for robotics, autonomous driving, and interactive virtual environments.

The code is publicly available at: https://github.com/FofGofx/WorldCache.