Relational Feature Caching for Accelerating Diffusion Transformers

Imagine you are an artist trying to paint a masterpiece, but you have to do it one tiny brushstroke at a time, and you have to do this 50 times for every single image. That's how modern AI image generators (called Diffusion Transformers) work. They start with a noisy, static-filled screen and slowly "denoise" it step-by-step until a clear picture appears.

The problem? Doing all 50 steps is incredibly slow and expensive for computers.

The Old Way: "Guessing the Next Step"

To speed things up, researchers tried a trick called Feature Caching.
Think of the AI's brain as a factory with many workers (modules). Every time the AI takes a step, these workers do a heavy calculation.

The Idea: "Hey, the picture didn't change much between step 10 and step 11. Let's just copy the work from step 10 and skip the calculation for step 11!"
The Problem: Sometimes the picture does change drastically. If you just copy the old work, you get blurry or weird artifacts.
The Previous Fix: Some smart researchers tried to predict the next step using math (like looking at the last two steps and drawing a straight line to guess the third). They called this "Taylor extrapolation."
- The Flaw: Imagine trying to predict the path of a drunk person walking. If you just draw a straight line based on their last two steps, you'll be wrong because they might suddenly stumble or turn. The AI's "steps" are just as unpredictable. The math was too rigid.

The New Solution: "Relational Feature Caching" (RFC)

The authors of this paper (from Yonsei University) realized that while the steps themselves are chaotic, there is a secret relationship between what goes into a worker and what comes out.

They introduced a framework called RFC with two main tools:

1. RFE: The "Input-Output Translator" (Relational Feature Estimation)

Instead of just guessing the next step based on time (like the old methods), this tool looks at the input.

The Analogy: Imagine a chef (the AI module). If you give the chef a slightly spicier ingredient (Input Change), you know the soup will taste slightly spicier (Output Change).
How it works: The old methods tried to guess the soup's taste just by looking at the clock. RFC looks at the ingredient bowl. It calculates: "If the input changed by X amount, the output will likely change by Y amount."
The Result: Because the relationship between input and output is very stable (even if the steps are chaotic), this guess is much more accurate than just guessing based on time.

2. RCS: The "Smart Alarm Clock" (Relational Cache Scheduling)

Even with a good translator, sometimes the chef gets overwhelmed and makes a mistake. You don't want to check the soup every single second (too slow), but you don't want to wait until it burns (too late).

The Analogy: Instead of checking the soup on a fixed schedule (e.g., every 5 minutes), RCS listens to the steam. If the steam (the error in the input) starts rising fast, the alarm goes off, and the chef stops guessing and actually tastes the soup (does the full calculation).
How it works: It monitors the "input prediction error." If the input is changing wildly, it knows the output will be wrong too, so it triggers a full calculation. If things are calm, it keeps skipping calculations to save time.

Why This Matters

Speed: It skips the heavy lifting whenever it's safe to do so.
Quality: It doesn't skip when the picture is changing fast, so the final image stays sharp and detailed.
The Analogy Summary:
- Old Method: Driving a car by looking at the rearview mirror and guessing where the road goes next. You might crash if the road curves.
- RFC: Driving a car while looking at the steering wheel (the input). You know exactly how much the car will turn based on how much you turn the wheel, so you can anticipate the curve perfectly without crashing.

The Bottom Line

The researchers tested this on various AI models (for images and videos) and found that RFC produces much higher quality images and videos than previous methods, while using the same amount of computer power. It's like getting a Ferrari engine upgrade for free just by changing how you look at the road.

1. Problem Statement

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative tasks (text-to-image, video), but their practical application is hindered by high computational costs due to the need for hundreds of denoising timesteps.

Feature Caching: A common acceleration strategy involves computing features at specific timesteps and reusing (caching) them for subsequent steps to avoid redundant calculations.
The Limitation of Existing Methods:
- Direct Reuse: Early methods simply reuse cached features, leading to error accumulation and quality degradation.
- Temporal Extrapolation: Recent state-of-the-art methods (e.g., TaylorSeer, FasterCache) predict future features using temporal extrapolation (e.g., Taylor expansion) based on the assumption that features evolve smoothly over time.
- The Core Issue: The authors observe that the magnitude of feature changes is highly irregular across timesteps. Relying solely on temporal history leads to significant prediction errors, especially when the interval between full computations is large, causing severe degradation in generation quality.

2. Methodology: Relational Feature Caching (RFC)

The authors propose RFC, a framework that shifts the prediction paradigm from purely temporal extrapolation to leveraging the relationship between input and output features within a module (e.g., Attention or MLP).

The framework consists of two novel components:

A. Relational Feature Estimation (RFE)

Observation: While the magnitude of output feature changes is irregular over time, it is strongly correlated with the magnitude of changes in the corresponding input features. The ratio between the change in output and the change in input remains relatively constant across timesteps.
Mechanism:
1. Instead of predicting the output change solely based on past output history, RFE estimates the magnitude of the output change ( $\|\Delta O\|$ ) by scaling the change in the input features ( $\|\Delta I\|$ ).
2. It calculates a scaling ratio $s_N(t)$ based on the most recent full-compute steps: $s_N(t) = \|\Delta O\| / \|\Delta I\|$ .
3. The predicted output is then refined by applying this ratio to the input difference, effectively correcting the magnitude of the Taylor expansion prediction.
Advantage: This allows the model to adapt to irregular feature dynamics that pure temporal extrapolation misses, significantly reducing prediction error.

B. Relational Cache Scheduling (RCS)

Problem: Fixed caching intervals are suboptimal because prediction errors fluctuate. Performing full computations too rarely degrades quality; too often wastes compute.
Mechanism:
1. RCS dynamically decides when to trigger a full computation.
2. Directly measuring output prediction error is impossible without a full computation. Instead, RCS uses the input prediction error as a proxy.
3. It tracks the accumulated relative L1 error of the input features (predicted via Taylor expansion vs. actual).
4. When the accumulated input error exceeds a threshold $\tau$ , a full computation is triggered.
Advantage: This ensures full computations happen precisely when feature dynamics become too complex for the cache to handle, optimizing the trade-off between efficiency and quality.

3. Key Contributions

Novel Insight: The paper identifies that output feature changes in DiTs are irregular over time but highly correlated with input feature changes, challenging the assumption of smooth temporal evolution used in prior works.
RFE (Relational Feature Estimation): A forecasting method that leverages input-output relationships to estimate the magnitude of output changes, significantly improving prediction accuracy over pure temporal extrapolation.
RCS (Relational Cache Scheduling): A dynamic scheduling strategy that uses input prediction errors as a proxy for output errors to adaptively trigger full computations.
Comprehensive Framework: RFC combines RFE and RCS to create a robust acceleration framework that outperforms existing methods across various DiT models and tasks.

4. Experimental Results

The authors evaluated RFC on DiT-XL/2 (ImageNet), FLUX.1 dev (Text-to-Image), and HunyuanVideo (Text-to-Video).

Quantitative Performance:
- RFC consistently outperforms state-of-the-art methods (FORA, TaylorSeer, ToCa, DuCa) across all metrics (FID, sFID, PSNR, SSIM, LPIPS).
- Example: On ImageNet with DiT-XL/2, RFC achieved an sFID of 2.52 with ~6.67 TFLOPs, outperforming TaylorSeer (sFID 2.84) which required similar compute.
- High Acceleration: Under extreme acceleration (low full-compute counts), RFC maintains high quality where TaylorSeer fails completely (e.g., sFID 5.39 vs. 13.19 for TaylorSeer at 6 full computations).
Qualitative Results:
- Visual comparisons show RFC preserves fine details (e.g., brick textures, object structures) much better than prior methods, closely matching full-computation results.
Efficiency:
- The overhead of computing input features for RFE/RCS is negligible (involving only lightweight operations like LayerNorm and scaling).
- RFC achieves better quality with fewer FLOPs compared to increasing the order of Taylor expansion in baseline methods.

5. Significance

Paradigm Shift: RFC moves beyond the "time-series" view of feature caching to a "relational" view, utilizing the internal structure of the neural network (input-output correlation) to make predictions.
Robustness: By addressing the irregularity of feature changes, RFC enables more aggressive acceleration (larger intervals between full computations) without sacrificing generation quality.
Generalizability: The method is applicable to various DiT architectures and tasks (image and video) and even shows adaptability to U-Net architectures.
Practical Impact: RFC offers a significant step forward in making high-fidelity diffusion transformers viable for real-world applications by drastically reducing inference latency and computational cost while maintaining state-of-the-art generation quality.