TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration

The Big Problem: The "Slow Motion" Movie

Imagine you are trying to turn a rough, blurry sketch into a beautiful, high-definition painting. A Diffusion Model (the AI doing the painting) does this by taking hundreds of tiny steps. In each step, it looks at the current image, guesses what the next step should look like, and makes a tiny improvement.

The Catch: To get a perfect picture, the AI usually needs to take 50 to 100 steps. This is like watching a movie in extreme slow motion. It takes a long time and uses a lot of computer power.
The Industry Need: In the real world (like making a video for TikTok or a game), we can't wait that long. We need the AI to finish in 20 steps or fewer.
The Old Solution (The Broken Shortcut): To speed things up, previous methods tried to "cheat." They said, "Hey, the image didn't change much in the last step, so let's just copy the last step's work!" or "Let's guess the next step using a simple straight line."
- The Failure: When you force the AI to take big jumps (fewer steps), these simple guesses fail. The image starts to glitch, colors get weird, and the structure falls apart. It's like trying to drive a car at 100 mph by only looking at the road once every mile; you'll crash.

The New Solution: TC-Padé (The "Smart Navigator")

The authors of this paper created TC-Padé. Think of it as a Smart Navigator for the AI's painting process. Instead of just copying the past or drawing a straight line, it uses a special mathematical tool called Padé Approximation to predict the future.

Here is how it works, broken down into three simple concepts:

1. The "Rational Function" vs. The "Straight Line"

The Old Way (Taylor Series): Imagine you are walking up a hill. The old AI methods assume the hill is a straight ramp. If you take a small step, that's fine. But if you take a giant leap, the "straight line" guess will miss the curve of the hill, and you'll fall off a cliff.
The TC-Padé Way: TC-Padé knows the hill might curve, twist, or have a sudden drop. It uses a Rational Function (a fancy fraction of two polynomials).
- Analogy: If the old method is a ruler (straight line), TC-Padé is a flexible measuring tape that can bend to fit the shape of the terrain. It can handle sudden changes and curves much better, allowing the AI to take giant leaps without falling off the cliff.

2. Watching the "Changes" Instead of the "Picture"

The Old Way: The old AI tried to predict the entire next picture. That's like trying to predict the exact position of every single grain of sand in a sandcastle. It's too much data, and small errors add up fast.
The TC-Padé Way: TC-Padé only predicts the difference (the residual) between the current picture and the next one.
- Analogy: Instead of describing the whole new painting, the AI just says, "Add a little blue here, and darken the shadow there."
- Why it helps: The "changes" are much smaller and more predictable than the whole image. It's like predicting the wind rather than predicting the entire weather system. This makes the prediction much more accurate, even when taking big steps.

3. The "Traffic Light" System (Adaptive Strategy)

The paper realizes that not all parts of the painting process are the same.

Early Stage (High Noise): The image is just a blur. The AI needs to make big, structural changes. TC-Padé uses a simple, fast guess here.
Middle Stage: The image is forming. TC-Padé uses its "flexible tape" (the Padé math) to navigate the complex curves.
Late Stage (Fine Details): The image is almost done. TC-Padé gets very careful, looking at tiny speed changes to add the final polish.
The Traffic Light: The system has a "Stability Indicator" (a traffic light).
- Green Light: The path is smooth? Skip the heavy math! Just use the prediction.
- Red Light: The path is getting bumpy or unstable? Stop! Do the full calculation to make sure we don't mess up.

The Results: Speed Without the Crash

The researchers tested this on powerful AI models (like FLUX.1 and Wan2.1 for video).

Speed: They managed to make the AI 2.88 times faster on image generation and 1.72 times faster on video generation.
Quality: Usually, when you speed up an AI this much, the quality drops like a stone. But with TC-Padé, the quality stayed almost exactly the same.
- Visual: The images didn't get blurry or weird.
- Video: The videos didn't glitch or lose their shape.

Summary

TC-Padé is like upgrading from a car that drives in a straight line and crashes on curves, to a self-driving car with a flexible suspension. It knows when to speed up and when to slow down, and it predicts the road ahead using a flexible map rather than a rigid ruler. This allows us to generate high-quality AI art and videos in seconds instead of minutes, without sacrificing the beauty of the final result.

1. Problem Statement

Diffusion models, particularly Diffusion Transformers (DiTs), have achieved state-of-the-art generation quality but suffer from high computational costs due to their iterative sampling process (often requiring 50–100 steps). While feature caching techniques exist to accelerate inference, they face critical limitations in low-step regimes (20–30 steps), which are standard for industrial deployment:

Trajectory Drift: As the interval between computation steps increases, the temporal similarity between features decays exponentially. Reuse-based methods (e.g., DeepCache) fail because cached activations no longer align with the current state.
Error Accumulation: Prediction-based methods relying on polynomial extrapolation (e.g., TaylorSeer) suffer from the limited radius of convergence of Taylor series. When the timestep gap is large, these methods accumulate errors, leading to significant visual degradation (artifacts, color shifts, texture loss).
Uniform Strategy Failure: Existing methods apply a single prediction strategy across the entire denoising process, ignoring that feature dynamics evolve differently during early (structural formation), mid, and late (detail refinement) stages.

2. Methodology: TC-Pad´e

The authors propose TC-Pad´e, a trajectory-consistent residual prediction framework grounded in Padé approximation. Instead of predicting raw features, it predicts residuals (the incremental updates between layers) using rational functions.

Core Components:

Padé-Inspired Residual Prediction:
- Rational vs. Polynomial: Unlike Taylor series (polynomials), Padé approximants use ratios of polynomials ( $P_m(x)/Q_n(x)$ ). This allows them to model asymptotic behaviors and rapid nonlinear transitions more accurately, which is crucial for the highly nonlinear dynamics of diffusion models at large timestep intervals.
- Residual Focus: The method caches and predicts residuals ( $R_t = x_{layer\_r} - x_{layer\_l}$ ) rather than raw features. Empirical evidence shows residuals maintain higher temporal cosine similarity than raw features, reducing the prediction error.
- Low-Order Approximation: The authors utilize a $[2/1]$ Padé approximant ( $k=3$ history steps, $m=1$ numerator order) to balance accuracy and memory overhead.
Adaptive Coefficient Modulation:
- To handle the stochastic nature of diffusion, coefficients are not fixed analytically but are adaptively modulated based on a stability factor ( $\sigma_{stab}$ ).
- A Trajectory Stableness Indicator (TSI) measures the magnitude of recent residual changes. If the trajectory is unstable (TSI < threshold $\theta$ ), full computation is performed. If stable, the Padé predictor is used.
- The stability factor ensures coefficients dampen rapidly when residuals change abruptly, preventing unstable extrapolation.
Step-Aware Prediction Strategies:
The denoising process is divided into three phases, each with a tailored prediction strategy:
- Early Stage ( $t < 0.2T$ ): Structural information evolves rapidly. The method uses a first-order difference term augmented with Padé prediction to capture subtle velocity changes.
- Middle Stage ( $0.2T \le t \le 0.7T$ ): The full Padé approximant is used to exploit long-range dependencies in the residual trajectory.
- Late Stage ( $t > 0.7T$ ): Focuses on fine-grained refinement using a weighted combination of the two most recent residuals.

3. Key Contributions

Rational Function Forecasting: Introduction of a Padé-based predictor for diffusion residuals, overcoming the convergence limitations of Taylor-series-based methods in low-step scenarios.
Trajectory Consistency: A framework that maintains sampling trajectory consistency even with large timestep intervals by combining adaptive stability checks with phase-specific prediction strategies.
Residual-Based Caching: Demonstrating that caching and predicting residuals yields significantly higher temporal similarity and lower error accumulation compared to caching raw features.
Plug-and-Play Acceleration: The method is training-free and compatible with existing DiT architectures (e.g., FLUX.1, Wan2.1).

4. Experimental Results

The method was evaluated on FLUX.1-dev (Text-to-Image), Wan2.1 (Text-to-Video), and DiT-XL/2 (Class-conditional Image) with a fixed budget of 20 steps.

FLUX.1-dev (Image):
- Achieved 2.88× speedup (latency) and 2.94× FLOPs reduction.
- Maintained high quality: FID increased only slightly (23.38 $\to$ 24.14), with strong PSNR (21.96) and SSIM (0.78).
- Outperformed TaylorSeer, which showed severe quality degradation (marked as "†" in tables) at similar speeds.
Wan2.1 (Video):
- Achieved 1.72× speedup with a VBench-2.0 score of 60.38% (only ~3.8 points below the 20-step baseline).
- Significantly better pixel-level metrics (PSNR, SSIM) compared to TaylorSeer and TeaCache.
DiT-XL/2 (ImageNet):
- Achieved 1.46× speedup with an FID of 6.93 (vs. baseline 3.56), outperforming reuse-based methods (ToCa, $\Delta$ -DiT) which suffered FID degradation to >8.0.
Deployment Efficiency:
- When combined with quantization, TC-Pad´e achieved up to 6× latency reduction and 2.5× throughput improvement over the baseline, demonstrating orthogonality to other acceleration techniques.

5. Significance

TC-Pad´e addresses a critical bottleneck in the practical deployment of diffusion models: the trade-off between inference speed and generation quality in low-step regimes. By shifting from polynomial to rational function approximation and focusing on residual dynamics, it enables high-fidelity generation with significantly fewer network evaluations. This makes high-quality generative AI more viable for latency-sensitive applications (e.g., real-time video generation, interactive tools) without requiring model retraining or architectural changes.