Training-free Latent Inter-Frame Pruning with Attention Recovery

Imagine you are a director filming a scene with a very expensive, slow-moving camera. You are filming a cartoon Santa Claus walking through a room.

The Problem:
The camera is so powerful that it takes a photo of every single pixel in the room, even the parts that aren't moving.

Frame 1: Santa walks in. The camera snaps a picture of Santa, the floor, the walls, and the ceiling.
Frame 2: Santa takes one step. The camera snaps another picture. But wait! The floor, the walls, and the ceiling look exactly the same as in Frame 1. Only Santa moved a tiny bit.
Frame 3: Santa takes another step. The camera snaps a third picture. Again, the background is identical.

The current AI video generators are like this camera. They waste a massive amount of time and battery power re-calculating the "floor" and "walls" for every single frame, even though they haven't changed. This makes generating videos slow and expensive.

The Solution: LIPAR (The Smart Editor)
The authors of this paper created a new method called LIPAR (Latent Inter-Frame Pruning with Attention Recovery). Think of it as a super-smart editor who watches the footage and says, "Hey, we don't need to film the wall again!"

Here is how it works, broken down into three simple steps:

1. The "Lazy" Pruning (Skipping the Boring Parts)

Instead of re-filming the whole scene, LIPAR looks at the previous frame. If a patch of the image (like the background wall) hasn't changed, it skips calculating it for the new frame.

Analogy: Imagine you are writing a story. In Chapter 1, you describe the room in great detail. In Chapter 2, the room is the same. Instead of rewriting the description of the room, you just say, "The room remained the same," and only write about the new action (Santa moving).
Result: The computer does way less work, making the video generation much faster.

2. The "Glitch" Problem (Why Skipping is Dangerous)

If you just skip the calculation and copy-paste the old background, you might run into a problem.

The Analogy: Imagine a musician playing a song. If you just copy-paste a note from the previous measure without changing the volume or the "vibe," the music starts to sound robotic and flat. In AI video, if you just copy the old data, the AI gets confused because it was trained to expect new data every time. This causes weird visual glitches, like the background shimmering or looking "noisy."

3. The "Attention Recovery" (The Magic Fix)

This is the paper's secret sauce. LIPAR doesn't just skip the work; it uses a clever trick to fake the calculation so the AI doesn't notice the difference.

The Analogy: Think of the AI as a chef making a soup. The chef is used to stirring the pot with a specific rhythm. If you suddenly stop stirring (pruning), the soup burns.
- LIPAR's trick: The chef stops stirring the whole pot but uses a special spoon to gently mimic the stirring motion just enough to keep the soup perfect.
- The "Noise" Secret: The paper discovered that AI video generation adds a little bit of random "static" (noise) to every frame, like grain in a photo. If you just copy the old frame, you accidentally copy the same static, which makes the video look weird. LIPAR's "Attention Recovery" ensures that even though it's reusing old data, it adds the right kind of new static so the AI thinks it's seeing a fresh, natural frame.

The Results: Speed vs. Quality

Before LIPAR, generating these videos was like driving a car at 8 miles per hour.

With LIPAR: The car speeds up to 12.2 miles per hour (a 45% increase in speed).
Memory: It uses 29% less computer memory (like needing a smaller gas tank).
Quality: The video looks just as good as the slow version. Humans couldn't tell the difference in blind tests!

Summary

LIPAR is like a smart video editor that knows when to stop working. It realizes that if the background isn't moving, it doesn't need to re-calculate it. But unlike a lazy editor who would just copy-paste and ruin the quality, LIPAR uses a mathematical "magic trick" to make the AI think it did the work, keeping the video smooth, fast, and high-quality.

It bridges the gap between old-school video compression (which skips static pixels) and modern AI video generation (which usually calculates everything), making real-time AI video a reality.

1. Problem Statement

Current video generation models, particularly Diffusion Transformers (DiTs), suffer from high computational latency and memory usage, making real-time applications (e.g., 30 FPS) prohibitively expensive on single GPUs.

Inefficiency: Unlike traditional video compression which skips redundant pixels, Latent Diffusion Models (LDMs) allocate fixed compute to every token regardless of content redundancy.
Limitations of Existing Methods: Previous token pruning or merging techniques (e.g., ToMe, IDM) often suffer from:
- High computational overhead due to complex token selection algorithms.
- Restricted application (often limited to specific layers).
- Training-Inference Discrepancy: Directly merging or pruning tokens creates visual artifacts because the inference distribution differs from the training distribution, and naive duplication violates the independent and identically distributed (I.I.D.) noise assumption inherent in diffusion models.

2. Core Methodology: LIPAR

The authors propose LIPAR, a training-free framework that extends inter-frame compression from pixel space to latent space. It operates in three stages:

A. Latent Inter-Frame (LIF) Pruning

Motivation: The paper empirically demonstrates a strong Pearson correlation (0.69–0.77) between temporal changes in pixel space and latent space. This implies that if a pixel patch is unchanged between frames, its corresponding latent patch is also likely unchanged.
Mechanism: The method compares consecutive latent patches at the same spatial location. If the L1 difference is below a threshold ( $\tau$ ), the patch is marked as redundant.
Refinement: To prevent "mispruning" of subtle motions, the method integrates motion detection using both short-term (consecutive) and long-term temporal differences.

B. Attention Recovery

Simply removing tokens disrupts the self-attention mechanism, causing artifacts. LIPAR introduces a novel Attention Recovery mechanism to approximate the output of the unpruned sequence without re-computing the pruned tokens. This consists of two sub-components:

M-Degree Approximation:
- Based on the mathematical formulation that self-attention outputs can be approximated if the keys/values of pruned tokens are close to their predecessors.
- It utilizes the Rotary Position Embedding (RoPE) properties. Since queries in causal attention correspond to recent tokens, the method approximates the exponential sum of attention scores by selecting the $m$ most recent indices (closest in rotational angle) to represent the pruned group.
- This allows the model to compute attention over a reduced set of tokens while mathematically bounding the approximation error.
Noise-Aware Duplication:
- The Problem: Naively duplicating tokens (copying $x_{t-1}$ to $x_t$ ) duplicates the Gaussian noise component ( $\epsilon$ ). This violates the I.I.D. noise assumption of diffusion models, leading to artificial correlations, noise amplification, and high-frequency visual artifacts.
- The Solution: The method duplicates only the clean signal (the denoised latent) from the KV-cache (generated at zero noise level) while ensuring the noise component remains independent. This preserves the statistical properties required for stable diffusion generation.

C. Restoration

After denoising the pruned sequence, the token count is restored to the original dimensions by duplicating the pruned patches from the previous frame (using the clean signal from the KV-cache) before decoding back to pixel space.

3. Key Contributions

Empirical Observation: Established a strong correlation between pixel-space and latent-space temporal redundancy, validating the adaptation of traditional video compression logic to generative pipelines.
Theoretical Formulation: Derived a general mathematical condition (Eq. 2 & 3) that pruning must satisfy to preserve visual quality, specifically focusing on approximating Multi-Head Self-Attention (MSA) outputs.
Novel Mechanism (Attention Recovery): Introduced a two-part solution (M-Degree Approximation + Noise-Aware Duplication) that bridges the training-inference gap, allowing for end-to-end pruning without retraining.
Training-Free & Generalizable: The method requires no additional training and is compatible with both causal (e.g., Self-Forcing) and bidirectional (e.g., TTM) attention architectures.

4. Experimental Results

The method was evaluated on the DAVIS dataset (51 video-text prompts) using an NVIDIA A6000 GPU.

Performance & Speed:
- Achieved 12.2 FPS compared to the baseline 8.4 FPS (Self-Forcing), representing a 1.45× speedup.
- Reduced GPU memory usage by 29% (from 26.24 GB to 18.56 GB).
- Demonstrated a linear correlation ( $r=0.999$ ) between the number of remaining tokens and inference latency, confirming $O(n)$ complexity.
Visual Quality:
- Human Evaluation: In a pairwise comparison with 14 participants, LIPAR achieved an 86.4% win-tie rate against the unpruned baseline. Users slightly preferred LIPAR (18.4%) over the baseline (13.3%) due to improved temporal consistency in backgrounds.
- Quantitative Metrics: Outperformed other training-free pruning methods (ToMe, Importance-based Merging, IDM) across all metrics (V-Bench, Warp Error, Subjective, Background, Motion, Image Quality) at pruning rates of 10%, 20%, and 32%.
- Ablation: Showed that without "Noise-Aware Duplication," the method produces severe noisy artifacts, validating the necessity of the proposed recovery mechanism.
Generalization: Successfully applied to Time-to-Move (TTM) (a bidirectional attention model), achieving a 1.5× throughput increase while maintaining quality.

5. Significance

LIPAR represents a significant step toward real-time, high-fidelity video generation. By effectively bridging the gap between traditional video compression algorithms and modern generative pipelines, it solves the critical bottleneck of computational cost without sacrificing visual quality. The "Training-free" nature of the approach makes it immediately deployable in existing video generation models, offering a scalable solution for interactive applications like video editing and human-machine interaction.