OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot

Imagine you have a master chef (a massive AI model like Stable Diffusion) who can cook up incredibly delicious, photorealistic images from a simple text recipe. This chef is brilliant, but they are also huge, expensive, and slow. They need a giant kitchen (lots of computer memory) and take a long time to prepare every dish because they follow a very detailed, step-by-step process.

The problem? Most people don't have a giant kitchen or the time to wait. We want a chef who is just as good but smaller, faster, and fits in a regular home kitchen.

Enter OBS-Diff. Think of it as a super-smart "瘦身" (slimming) coach for these AI chefs. It doesn't teach the chef new recipes (no training required); instead, it surgically removes the parts of the chef's brain that aren't actually needed, making them faster without ruining their cooking skills.

Here is how OBS-Diff works, explained with some everyday analogies:

1. The Problem: Why Old Methods Fail

Imagine the AI chef doesn't just cook a meal in one go. They cook it in 28 tiny steps, starting with a blurry blob and slowly refining it into a clear picture.

Old pruning methods are like a chef who only looks at the final plate to decide what ingredients to throw away. They might accidentally throw away a secret spice that was crucial for the first step of the cooking process. By the time the dish is done, it tastes terrible.
The Diffusion Challenge: Because the AI builds the image step-by-step, a mistake made in the first step ruins the whole dish. Old methods didn't understand this "step-by-step" nature.

2. The Solution: OBS-Diff's Three Superpowers

A. The "Time-Travel" Weighting System (Timestep-Aware Hessian)

OBS-Diff realizes that early steps are more important than late steps.

The Analogy: Imagine building a house. If you mess up the foundation (Step 1), the whole house collapses, no matter how pretty the paint is on the roof (Step 28). If you mess up the paint (Step 28), the house is still standing.
How it works: OBS-Diff puts a "magnifying glass" on the early steps. It says, "We must be super careful not to cut any weights (ingredients) used in the first few steps." It uses a special mathematical formula (a logarithmic schedule) to prioritize the beginning of the process, ensuring the foundation stays solid.

B. The "Group Surgery" Strategy (Module Packages)

Usually, to trim a giant AI, you have to test it layer by layer. For a diffusion model, testing one layer means running the entire 28-step cooking process. Doing this for every single layer would take forever (like testing a single spice by cooking a whole meal 1,000 times).

The Analogy: Instead of testing one spice at a time, OBS-Diff groups the spices into batches (called "Module Packages").
How it works: It runs the cooking process once, but while it's cooking, it collects data on all the spices in that batch simultaneously. Then, it trims the whole batch at once. This is like a surgeon operating on a whole group of organs at once rather than one by one, saving massive amounts of time and energy.

C. The "One-Shot" Miracle (Training-Free)

Most ways to make an AI smaller require re-teaching it (fine-tuning), which takes days and huge amounts of electricity.

The Analogy: Imagine you have a library of books. Most methods say, "Let's delete some pages, then hire a teacher to re-write the whole book to make sense again."
OBS-Diff says: "No teacher needed." It uses a classic mathematical trick called Optimal Brain Surgeon (OBS). It calculates exactly which pages (weights) can be removed and how to slightly adjust the remaining pages so the story still makes perfect sense. It does this in one shot, instantly.

3. The Results: What Happens?

The paper tested this on some of the world's most famous image generators (like Stable Diffusion 3 and Flux).

The Test: They tried to shrink the models by 50% to 70% (removing half or more of the brain).
The Outcome:
- Old methods: The images became garbage—blurry, distorted, or nonsensical.
- OBS-Diff: The images looked almost identical to the original giant models! The chef could still cook a perfect "portrait of a human growing flowers from their hair" even after losing half their brain.
- Speed: Because the model is smaller, it generates images faster (up to 30% faster in some cases).

Summary

OBS-Diff is like a master sculptor who knows exactly which parts of a giant stone statue to chip away to make it lighter and faster, without breaking the statue's face. It understands that the beginning of the creation process is the most critical, groups its work to save time, and does it all instantly without needing to retrain the AI.

It allows us to run these powerful, high-quality AI image generators on smaller, cheaper computers, making them accessible to everyone.

1. Problem Statement

Large-scale text-to-image diffusion models (e.g., Stable Diffusion 3, Flux) offer high-quality generation but suffer from prohibitive computational costs and memory requirements due to their massive parameter counts (often billions). While model compression techniques like pruning exist, they face significant challenges when applied to diffusion models:

Iterative Nature: Unlike standard feed-forward networks, diffusion models operate via an iterative denoising process where parameters are shared across multiple timesteps. Existing one-shot pruning methods (designed for LLMs) fail to account for the compounding effect of errors introduced in early timesteps.
Architectural Diversity: Modern diffusion models use diverse architectures (e.g., Multimodal Diffusion Transformers or MMDiT), rendering architecture-specific pruning methods (like those for U-Nets) non-generalizable.
Training Overhead: Many existing pruning methods require expensive gradient-based fine-tuning or retraining, negating the efficiency benefits of compression.
Granularity Limitations: There is a lack of effective methods for unstructured and semi-structured (N:M) pruning in large-scale diffusion models.

2. Methodology: OBS-Diff

The authors propose OBS-Diff, a novel, one-shot, training-free pruning framework that adapts the classic Optimal Brain Surgeon (OBS) algorithm to the unique dynamics of diffusion models. The framework consists of three core components:

A. Timestep-Aware Hessian Construction

Standard layer-wise pruning minimizes reconstruction error for a single forward pass. OBS-Diff reformulates this to account for the iterative denoising trajectory.

Error Accumulation: The authors recognize that errors introduced in early timesteps ( $t$ ) propagate and compound through subsequent steps, causing severe degradation in the final output.
Weighting Scheme: They introduce a logarithmic-decreasing weighting scheme ( $\alpha_t$ ) for the Hessian matrix construction. This assigns higher importance to activations from earlier timesteps.
Formulation: The Hessian matrix $H_l$ for layer $l$ is computed as a weighted sum over all timesteps:
$H_l = 2 \sum_{t=1}^{T} \alpha_t \mathbb{E}[X_{l,t} X_{l,t}^T]$
where $\alpha_t$ decreases as $t$ increases, ensuring the pruning criteria prioritize preserving the network's function during critical early generation stages.

B. Module Packages (Group-Wise Sequential Pruning)

To address the computational bottleneck of calibrating every layer individually in an iterative model (which would require running full denoising trajectories for every layer), OBS-Diff introduces a Group-Wise Sequential Pruning Strategy.

Module Packages: The model is partitioned into "Module Packages" (groups of layers with independent inputs, e.g., QKV projections).
Amortized Calibration: Instead of calibrating layer-by-layer, the framework runs the full denoising trajectory once per package. It uses forward hooks to collect activation statistics for all layers within the package simultaneously.
Efficiency: This drastically reduces the number of forward passes required, balancing memory usage (storing multiple Hessians) with computational time.

C. Adaptability to Sparsity Granularities

OBS-Diff is designed to be versatile across different pruning granularities:

Unstructured: Direct application of OBS to individual weights.
Semi-Structured (N:M): For patterns like 2:4, the method prunes the $N$ weights with the lowest saliency scores within every block of $M$ weights.
Structured:
- FFN Neurons: Aggregates saliency scores of weights within a neuron to prune entire columns.
- MHA Heads: For Multi-Head Attention, it aggregates saliency across head weights. Crucially, for MMDiT architectures where shared attention heads feed into separate modality-specific paths, the authors use Reciprocal Rank Fusion (RRF) to combine importance rankings from different modalities into a single decision list for pruning.

3. Key Contributions

Revitalized OBS for Diffusion: Successfully adapted the Optimal Brain Surgeon framework to handle the complex, iterative nature of modern diffusion models (specifically MMDiT) without requiring retraining.
Timestep-Aware Hessian: Proposed a novel Hessian construction that weights early timesteps more heavily, effectively mitigating error accumulation during the denoising process.
Efficient Calibration Strategy: Introduced the "Module Package" approach to amortize the expensive data collection process, making one-shot pruning feasible for billion-parameter models.
Universal Granularity Support: Demonstrated the ability to perform unstructured, semi-structured (2:4), and structured (head/neuron) pruning within a single unified framework.

4. Experimental Results

The authors evaluated OBS-Diff on a diverse range of models: Stable Diffusion v2.1, SD 3-Medium, SD 3.5-Large, and Flux.1-dev, as well as a smaller DDPM on CIFAR-10.

Performance Metrics: Evaluated using FID, CLIP Score, and ImageReward on the MS-COCO 2014 validation set.
Unstructured Pruning: OBS-Diff significantly outperformed baselines (Magnitude, DSnoT, Wanda).
- At 50% sparsity on SD 3-Medium, OBS-Diff achieved an ImageReward of 0.6468, while Wanda dropped to -0.1076 and Magnitude to -2.2719.
- Baseline methods often produced "totally destroyed" images at high sparsity, whereas OBS-Diff maintained visual coherence.
Semi-Structured Pruning (2:4): On SD 3.5-Large, OBS-Diff achieved a CLIP score of 0.3129 and ImageReward of 0.4493, surpassing Wanda and DSnoT.
Structured Pruning: OBS-Diff showed remarkable resilience. At 30% sparsity on SDXL, the L1-norm baseline failed completely (FID > 170), while OBS-Diff maintained a strong FID of 29.75 (close to the dense model's 29.21).
Efficiency: The pruning process for the 2B parameter SD 3-Medium model took under 15 minutes on a single NVIDIA RTX 4090.
Inference Speedup: Achieved up to 1.31x speedup in wall-clock time for structured pruning at 30% sparsity.

5. Significance

State-of-the-Art (SOTA): OBS-Diff sets a new benchmark for training-free, one-shot pruning of diffusion models, outperforming existing methods across diverse architectures and sparsity levels.
Accessibility: By enabling accurate compression without fine-tuning, it lowers the barrier for deploying large-scale diffusion models on resource-constrained hardware.
Generalizability: The framework's ability to handle MMDiT, U-Net, and various sparsity patterns suggests it is a robust solution for the evolving landscape of generative AI.
Theoretical Insight: The work highlights the critical importance of temporal dynamics in diffusion model compression, proving that treating diffusion steps as a unified trajectory rather than independent layers is essential for high-fidelity pruning.