SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

The Big Problem: The "Slow Motion" Video Maker

Imagine you have a magical robot artist (a Diffusion Model) that can draw or create videos. But this robot is incredibly slow. To create a single 5-second video, it has to take 50 tiny steps. In each step, it looks at a blurry, noisy mess, thinks hard, and redraws the picture slightly clearer.

Doing this 50 times takes a long time and uses a lot of computer power. It's like trying to paint a masterpiece by making 50 tiny, separate brushstrokes, stopping to mix new paint for every single one.

The Old Solution: The "Guessing Game"

To speed this up, researchers tried a trick called Caching.
Think of it like this: If the robot is drawing a blue sky, and the picture looks almost the same from step 10 to step 11, why bother recalculating the whole sky? Just reuse the picture from step 10 for step 11.

However, the old methods (like TeaCache or MagCache) were like bad guessers. They used simple rules of thumb:

"If the time step number changed by 2, reuse the picture."
"If the difference between the last two pictures is small, reuse it."

The Problem: These rules are "one-size-fits-all." Sometimes the robot is drawing a calm ocean (easy to skip steps), and sometimes it's drawing a chaotic explosion (hard to skip steps). The old guessers didn't know the difference. They would either:

Skip too much: The video gets blurry or weird (like a glitchy video game).
Skip too little: The video is perfect, but it still takes forever to make.

The New Solution: SenCache (The "Sensitivity Sensor")

The authors of this paper, Yasaman Haghighi and Alexandre Alahi, came up with a smarter way. They call it SenCache.

Instead of guessing, SenCache asks the robot a specific question before reusing a picture:

"How much would your drawing change if I nudged the input just a tiny bit?"

This is called Sensitivity.

The Creative Analogy: The Tightrope vs. The Trampoline

Imagine the robot's brain is a landscape:

The Flat Field (Low Sensitivity): Imagine walking on a flat, grassy field. If you take a step forward or backward, your view doesn't change much. You can walk fast without looking down. This is when SenCache says: "Safe to reuse the old picture!"
The Tightrope (High Sensitivity): Imagine walking on a wobbly tightrope. If you move even a millimeter, your balance shifts wildly. You need to look down and adjust constantly. This is when SenCache says: "Stop! Do the hard work. Do not reuse the old picture."

How SenCache Works:

It measures the "wobble." It calculates how sensitive the robot is to changes in the noise (the blurry picture) and the time (which step we are on).
It decides dynamically.
- If the robot is in a "flat field" zone (calm parts of the video), SenCache skips the calculation and reuses the cached result.
- If the robot is on a "tightrope" (complex parts of the video), SenCache forces the robot to do the full calculation to ensure quality.
It adapts per video. Unlike the old methods that used the same rules for every video, SenCache looks at this specific video and decides, "Okay, this scene is easy, let's skip. This scene is hard, let's work."

Why This is a Big Deal

No Training Required: You don't need to retrain the robot. You just give it a new set of instructions (the sensitivity rule) to follow while it works.
Better Quality at the Same Speed: In tests with top video models (like Wan 2.1 and CogVideoX), SenCache produced clearer, sharper videos than the old guessing methods, even when they were both trying to be fast.
The "Why" is Clear: The paper explains why the old methods failed. They only looked at one thing (like just the time step or just the picture difference). SenCache looks at both the time and the picture together, realizing that sometimes the time matters more, and sometimes the picture matters more.

The Bottom Line

SenCache is like giving the robot artist a pair of smart glasses. Instead of blindly following a schedule, the robot can now "see" which parts of the video are boring and safe to skip, and which parts are exciting and need full attention.

The result? You get high-quality videos much faster, without the robot getting tired or making mistakes. It turns a slow, rigid process into a smart, adaptive one.

1. Problem Statement

Diffusion models, particularly those used for high-quality video generation (e.g., Wan 2.1, CogVideoX, LTX-Video), suffer from high inference latency. Generating a single sample requires hundreds of sequential denoising steps, where each step involves a full forward pass through a large neural network.

While caching-based acceleration methods exist (reusing previously computed outputs to skip redundant forward passes), current state-of-the-art approaches (e.g., TeaCache, MagCache) rely on empirical heuristics to decide when to skip steps. These methods have two fundamental limitations:

Lack of Theoretical Basis: They use ad-hoc rules (e.g., residual magnitude or time-embedding differences) without a rigorous mathematical justification for why a skip is safe.
Static Schedules: They often apply fixed rules across all samples, failing to adapt to the varying complexity and dynamics of individual generation trajectories. This leads to either over-caching (causing visual artifacts) or under-caching (missing optimization opportunities).

2. Methodology: SenCache

The authors propose SenCache, a training-free, dynamic caching framework grounded in the local sensitivity of the denoising network.

Core Insight

The paper posits that the safety of reusing a cached output depends on how much the network's output changes in response to perturbations in its inputs: the noisy latent ( $x_t$ ) and the timestep ( $t$ ).

Sensitivity Analysis: The authors analyze the Jacobian norms of the denoiser $f_\theta$ with respect to $x_t$ and $t$ .
Key Observation: Both latent drift ( $\Delta x_t$ ) and timestep spacing ( $\Delta t$ ) significantly contribute to output changes. Prior methods often ignored one of these factors (e.g., TeaCache focused on $t$ , MagCache on $x_t$ ), leading to suboptimal decisions.

The Sensitivity-Aware Criterion

Using a first-order Taylor expansion, the change in the denoiser output between steps is approximated as:
$\Delta f \approx J_x \Delta x_t + J_t \Delta t$
Where $J_x$ and $J_t$ are the Jacobians (sensitivity) with respect to the latent and timestep, respectively.

SenCache defines a Sensitivity Score ( $S_t$ ):
$S_t = \|J_x\| \|\Delta x_t\| + \|J_t\| |\Delta t|$

The Caching Rule:
At each step, the algorithm calculates $S_t$ . If $S_t \leq \epsilon$ (where $\epsilon$ is a user-defined tolerance), the output is considered stable enough to reuse the cached value. Otherwise, a fresh forward pass is performed, and the cache is updated.

Practical Implementation

Since computing exact Jacobians is computationally expensive, SenCache uses finite-difference approximations (secant estimates) to approximate $\|J_x\|$ and $\|J_t\|$ .

Calibration: Sensitivity profiles are estimated once per model using a small calibration set (experiments show 8 diverse videos are sufficient).
Dynamic Adaptation: Unlike static methods, SenCache makes per-sample decisions based on the actual trajectory dynamics.
Safety Mechanism: A hyperparameter $n$ limits the maximum number of consecutive cache hits to prevent error accumulation as the trajectory drifts away from the reference point.

3. Key Contributions

Theoretical Framework: The first principled derivation of caching criteria based on network sensitivity to both latent and timestep inputs, providing a mathematical bound for caching error.
Dynamic Policy: A sample-specific, adaptive caching strategy that replaces static heuristics with a dynamic decision rule based on local model behavior.
Unified Explanation: The framework explains why prior heuristics work in some regimes (where one sensitivity term dominates) and fail in others (where the ignored term causes significant error).
Architecture Agnostic: The method requires no retraining, no model modification, and is compatible with any sampler or architecture (tested on DiTs and U-Nets).

4. Experimental Results

The authors evaluated SenCache on three state-of-the-art video diffusion models: Wan 2.1, CogVideoX, and LTX-Video.

Visual Quality vs. Efficiency: Under similar computational budgets (measured by Number of Function Evaluations - NFE), SenCache consistently achieved superior visual quality compared to TeaCache and MagCache.
- Example (Wan 2.1, Fast Regime): SenCache achieved an LPIPS of 0.0540 and PSNR of 29.14, outperforming MagCache (LPIPS 0.0603, PSNR 28.36) and TeaCache (LPIPS 0.0966, PSNR 25.06) at comparable NFE.
Ablation Studies:
- Tolerance ( $\epsilon$ ): Demonstrated a clear, linear trade-off between speed (NFE) and quality. Users can tune $\epsilon$ to prioritize speed or fidelity.
- Cache Lifetime ( $n$ ): Increasing the max consecutive cache steps ( $n$ ) improves speed up to a point ( $n=4$ ), after which quality degrades due to approximation errors, while speed gains saturate.
- Calibration Size: Confirmed that sensitivity estimates are stable with very small calibration sets (8 videos), making the method practical for deployment.

5. Significance and Future Work

Significance:
SenCache bridges the gap between heuristic acceleration and theoretical rigor in diffusion inference. By explicitly modeling the sensitivity of the network to both time and latent space, it avoids the "blind spots" of previous methods, offering a robust way to accelerate video generation without retraining. It proves that local smoothness is a reliable predictor for caching safety.

Future Directions:

Higher-Order Estimators: Moving beyond first-order approximations to capture non-linear regimes more accurately.
Dynamic Scheduling: Adapting the tolerance $\epsilon$ dynamically across different timesteps (e.g., allowing larger errors in early/late stages where fidelity is less critical).
Cross-Modal Extension: Applying the sensitivity-aware principle to other domains like audio, text, and human motion generation.

In summary, SenCache offers a mathematically grounded, efficient, and adaptable solution to the computational bottleneck of diffusion models, setting a new standard for training-free inference acceleration.