LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation

Imagine you have a brilliant, world-class chef (the Video Diffusion Model) who can cook up incredibly realistic, high-definition movies from a simple description like "a dragon flying over a castle." This chef is amazing, but there's a catch: they are incredibly slow and expensive to run.

Why? Because to cook a 10-second video, the chef has to look at every single frame and compare it to every other frame to make sure the dragon's wings move smoothly and the clouds don't flicker. If the video is long, this "comparing everything to everything" task grows explosively large. It's like trying to introduce every single guest at a massive wedding to every other guest; the number of handshakes becomes impossible to manage. In tech terms, this is called quadratic complexity ( $O(n^2)$ ).

The paper introduces LINVIDEO, a new "kitchen renovation" that makes this chef faster without hiring a new one or buying new ingredients. Here is how they did it, explained simply:

1. The Problem: The "Slow Chef" vs. The "Fast Assistant"

Scientists already knew about a "Fast Assistant" (called Linear Attention) who can cook much faster ( $O(n)$ ) by using a shortcut. Instead of introducing every guest to everyone, the assistant just remembers the general vibe of the room.

However, there's a big problem: If you just swap the Master Chef for the Fast Assistant, the food tastes terrible. The video becomes blurry, the motion is jerky, and the dragon looks like a blob. Usually, to fix this, you'd have to send the Fast Assistant back to culinary school for years (called pre-training) to learn the Master Chef's secrets. That takes too much time and money.

The Question: Can we teach the Master Chef to use the Fast Assistant's shortcuts right now, without sending them back to school?

2. The Solution: LINVIDEO (The Smart Renovation)

The authors created a framework called LINVIDEO. Think of it as a smart renovation crew that upgrades the kitchen while the chef is still working, without needing a new recipe book (data-free).

They used two clever tricks:

Trick A: The "Selective Transfer" (Don't Fire Everyone)

The team realized that not all parts of the chef's brain are equally important.

The Analogy: Imagine the chef has 30 different stations (layers). Some stations handle the basic chopping (early layers), while others handle the complex plating and final garnish (deep layers).
The Mistake: If you replace the plating station with a fast-but-dumb robot, the dish looks ugly. If you replace the chopping station, the robot might mess up the knife skills.
The LINVIDEO Fix: Instead of guessing which stations to upgrade, they gave the kitchen a "smart switch" for every station. This switch learns, through trial and error, which stations can safely be swapped for the Fast Assistant and which ones must stay as the Master Chef.
The Result: They automatically found the perfect mix: "Keep the Master Chef on the complex plating, but let the Fast Assistant handle the chopping." This minimizes the drop in quality.

Trick B: The "Anytime Distribution Matching" (The Real-Time Taste Test)

Usually, when you try to speed up a model, you only check if the final video looks good. But in video generation, if the middle of the video is weird, the end will be weird too.

The Analogy: Imagine a student taking a test. If you only grade them on the final answer, they might have guessed their way there. But if you check their work at every step of the problem, you can correct them immediately.
The LINVIDEO Fix: They created a new "taste test" called Anytime Distribution Matching (ADM). Instead of waiting until the video is finished to see if it's good, they check the "flavor" of the video at every single second of the cooking process. They force the Fast Assistant to match the Master Chef's style at every moment, not just at the end.
The Result: This prevents the video from getting "jittery" or flickering, ensuring the whole movie feels smooth and natural.

3. The Results: Fast, Cheap, and Tasty

After this renovation, the results were impressive:

Speed: The video generation became 1.4 to 1.7 times faster just by swapping the attention layers.
Super Speed: When they combined this with a technique to skip steps (distillation), they created a "4-step" model that is 16 to 21 times faster than the original!
Quality: The videos still looked amazing. The "Master Chef" quality was preserved, even though they were using the "Fast Assistant" for most of the work.

Summary

LINVIDEO is like taking a slow, expensive luxury car and installing a high-performance engine in just the right parts of the chassis. You don't need to rebuild the whole car from scratch (pre-training), and you don't need a new driver (new data). You just tweak the existing machine so it drives twice as fast while still getting you to the destination in style.

This is a huge step forward because it means we can generate high-quality AI videos on regular computers much faster, making creative tools accessible to everyone, not just big tech companies with massive servers.

1. Problem Statement

Video Diffusion Models (DMs), particularly those based on the Diffusion Transformer (DiT) architecture, have achieved state-of-the-art results in video synthesis. However, they face a critical scalability bottleneck:

Quadratic Complexity: The self-attention mechanism in these models scales quadratically ( $O(n^2)$ ) with the sequence length $n$ . Since video generation requires processing long sequences (e.g., >50k tokens for a 10-second video), inference becomes computationally prohibitive and memory-intensive.
Limitations of Existing Solutions:
- Attention Sparsification: Methods that skip redundant computations often fail to achieve high sparsity at moderate sequence lengths, retaining >50% of the original quadratic cost.
- Linear Attention: While theoretically offering $O(n)$ complexity, fully replacing quadratic attention with linear attention in pre-trained models typically requires costly pretraining from scratch. This is due to the "representation gap" between quadratic and linear attention and the complex spatiotemporal dynamics of video, which make simple post-training ineffective.

Core Question: Can we efficiently replace a significant portion of quadratic attention layers with linear attention in a pre-trained video DM via post-training, without degrading generation quality?

2. Methodology: LINVIDEO

The authors propose LINVIDEO, a data-free post-training framework that selectively converts quadratic attention layers to linear attention while preserving performance. The framework consists of three key components:

A. Data-Free Post-Training Setup

Instead of requiring large, curated video datasets (which are often restricted by copyright or scale), LINVIDEO generates its own training data:

It samples initial noise $x_1$ and uses the original pre-trained model ( $u_\theta$ ) to generate the entire sampling trajectory (input-output pairs at all timesteps).
This allows the new model ( $\hat{u}_\theta$ ) to learn from the original model's own predictions, eliminating the need for external data.

B. Selective Transfer (Layer Selection)

The authors observed that replacing different layers with linear attention yields vastly different performance outcomes. Some layers are robust to replacement, while others (e.g., the first layer) cause catastrophic failure.

Binary Classification Formulation: Instead of manual or heuristic selection, the framework treats layer selection as a binary classification problem.
Learnable Scores: Each layer $l$ $l$ is assigned a learnable scalar $r^{(l)} \in [0, 1]$ $r^{(l)} \in [0, 1]$ .
- $r^{(l)} \approx 1$ : Keep Quadratic Attention.
- $r^{(l)} \approx 0$ : Switch to Linear Attention.
Mixed Attention Mechanism: During training, the output is a weighted sum of both attention types:
$o_i = r \cdot \text{SoftmaxAttn} + (1-r) \cdot \text{LinearAttn}$
Constraints:
- Target Constraint Loss ( $L_{con}$ ): Ensures the total number of replaced layers matches a predefined target.
- Regularization Loss ( $L_{reg}$ ): A regularization term (inspired by quantization) forces $r$ values to converge toward 0 or 1 during training, minimizing rounding errors when the model is deployed.

C. Anytime Distribution Matching (ADM) Objective

Standard post-training objectives (like MSE on outputs) fail here because they introduce temporal artifacts (flickering) and ignore intermediate timesteps. Few-step distillation methods are also inefficient as they require training an auxiliary model to estimate score functions.

The Solution: The authors introduce Anytime Distribution Matching (ADM).
Mechanism: Instead of matching only the final output distribution ( $t=0$ ), ADM aligns the sample distributions of the linearized model and the original model at every timestep $t \in [0, 1]$ along the sampling trajectory.
Efficiency: The score function for the linearized model is estimated using the model itself (self-distillation), avoiding the need for an expensive auxiliary teacher model.
Loss Function: Minimizes the Kullback-Leibler (KL) divergence between the distributions $q_t$ (linearized) and $p_t$ (original) at any timestep.

3. Key Contributions

First Data-Free Post-Training Framework: LINVIDEO is the first method to successfully replace quadratic attention with linear attention in pre-trained video DMs without requiring external video datasets.
Selective Transfer Strategy: A novel, learning-based approach that automatically identifies which layers can be linearized, minimizing performance degradation compared to manual or heuristic selection.
Anytime Distribution Matching (ADM): A highly efficient training objective that aligns distributions across the entire sampling trajectory, recovering model performance that is typically lost in linearization.
Distillation Compatibility: The framework is the first to successfully apply few-step distillation to a linear-attention video DM, achieving extreme speedups.

4. Experimental Results

The method was evaluated on Wan 1.3B and Wan 14B models using the VBench and VBench-2.0 benchmarks.

Inference Speedup:
- LINVIDEO achieves a 1.43× to 1.71× speedup in latency compared to standard dense attention (FlashAttention2) while maintaining or slightly improving generation quality.
- When combined with 4-step distillation (DMD2), the models achieve a massive 15.9× to 20.9× speedup with only a minor drop in visual quality (e.g., <3% drop in overall consistency).
Quality Preservation:
- On VBench dimensions (Subject Consistency, Imaging Quality, Motion Smoothness, etc.), LINVIDEO outperforms existing sparse-attention baselines (like SVG, XAttn) and often matches or exceeds the lossless FlashAttention2 baseline.
- Visual results show high fidelity in complex scenes (e.g., fluid dynamics, human motion) even after aggressive linearization and distillation.
Ablation Studies:
- Selective Transfer: Outperforms manual and heuristic layer selection by a significant margin.
- ADM: Outperforms naive MSE loss and standard few-step distillation losses, confirming the necessity of aligning distributions at all timesteps.
- Regularization: Removing the $L_{reg}$ term causes performance to collapse, highlighting its role in stabilizing the $r$ values.

5. Significance

Scalability: LINVIDEO addresses the primary bottleneck of video generation (quadratic complexity), making high-quality video synthesis feasible on consumer-grade hardware or for longer sequences.
Cost-Efficiency: By being data-free, it removes the barrier of needing massive, copyrighted video datasets for fine-tuning, making the technology more accessible and legally safer to deploy.
Practical Deployment: The ability to achieve ~20× speedup with minimal quality loss via distillation opens the door for real-time video generation applications, which were previously impossible with standard DiT architectures.
Generalizability: The framework is model-agnostic and can be applied to various pre-trained video DMs (demonstrated on Wan and CogVideoX), suggesting a general solution for linearizing diffusion transformers.