Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

Imagine you have a giant, world-class chef (the original 60-layer AI model) who can cook absolutely anything. This chef is incredibly talented, but they are also huge: they need a massive kitchen, a team of 50 assistants, and a fortune in ingredients to make a single meal. Most people can't afford to hire them or even fit them in their home kitchen.

The paper you shared introduces Amber-Image, which is like a brilliant culinary school that teaches this giant chef how to downsize into a compact, efficient home cook without losing their ability to make gourmet meals.

Here is how they did it, broken down into simple steps:

1. The Problem: The "Over-Engineered" Kitchen

Current top-tier AI image generators (like the one they started with, called Qwen-Image) are like those giant chefs. They have 60 layers of "thinking" steps. To make an image, the AI has to pass the idea through all 60 layers. This takes a massive amount of computer power (GPU hours) and money. It's like using a nuclear reactor to boil an egg.

2. The Solution: The "Smart Downsize"

The researchers didn't build a new chef from scratch (which would take years and millions of dollars). Instead, they took the existing giant chef and compressed them. They created two smaller versions: Amber-Image-10B and Amber-Image-6B.

They used three clever tricks to do this:

Trick A: The "Redundant Assistant" Audit (Depth Pruning)

Imagine the 60 layers of the chef's brain as 60 assistants passing a recipe down a line. The researchers realized that some assistants were just whispering the same thing the previous assistant said. They didn't add much new value.

What they did: They identified the 30 "least important" assistants and let them go.
The Magic: Instead of just deleting them and leaving a gap, they took the knowledge of the fired assistants and blended it into the remaining ones. It's like taking the notes from the fired assistants and pasting them into the notebooks of the remaining staff so the team still knows everything. This cut the model size in half immediately.

Trick B: The "Hybrid Kitchen" (Single-Stream Conversion)

In the original chef's kitchen, there were two separate teams: one for "Text" (reading the recipe) and one for "Image" (cooking the food). They worked in parallel.

What they did: For the first part of the cooking process, they kept both teams separate because they need to focus on different things. But for the later stages (when the food is actually being plated), they realized the teams were doing very similar things.
The Magic: They merged the two teams into one super-team for the final 20 steps. This saved even more space and energy, creating the even smaller Amber-Image-6B.

Trick C: The "Shadow Training" (Knowledge Distillation)

When you fire half the staff and merge the teams, the kitchen might get chaotic. The food might taste wrong at first.

What they did: They didn't throw away the giant chef. They kept the original 60-layer chef in the room as a teacher.
The Magic: The new, smaller team worked on a few thousand high-quality recipes while the giant chef watched. Whenever the small team made a mistake, the giant chef corrected them. This "shadow training" happened very quickly and didn't require millions of new recipes. It just required the small team to mimic the big team's style.

3. The Results: A Tiny Chef with a Giant's Skill

The results were shocking.

Speed & Cost: The whole process of shrinking the model took less than 2,000 GPU hours. To put that in perspective, training a new model from scratch usually takes tens of thousands of hours. It's the difference between a weekend project and a decade-long construction.
Quality: The new "small" chefs (Amber-Image) could cook meals that were just as delicious as the giant one. In fact, on many tests (like following complex instructions or drawing specific objects), they actually did better than the original giant chef and even beat some of the most expensive, closed-source systems in the world.
Text: They are particularly good at writing words inside images (like drawing a sign that says "Open"), which is usually very hard for AI.

The Bottom Line

The paper proves that you don't need a supercomputer and a billion dollars to make amazing AI art. By being smart about cutting out the fluff and teaching the small model to copy the big one, you can get 90% of the performance with 30% of the cost.

It's like taking a Formula 1 race car, removing the unnecessary aerodynamics and heavy armor, and turning it into a sleek, fast sports car that you can actually drive on the street, all while keeping the engine's power intact.

1. Problem Statement

While Diffusion Transformers (DiT) and Multi-Modal Diffusion Transformers (MMDiT) have revolutionized Text-to-Image (T2I) generation, they face two critical barriers:

Prohibitive Computational Costs: State-of-the-art models (e.g., Qwen-Image, FLUX.2) often have tens of billions of parameters, making training and inference expensive and difficult to deploy on consumer hardware.
Deployment Barriers: High-performance open-source models require massive datasets and computational resources (often tens of thousands of GPU hours) to train from scratch, limiting accessibility for individual researchers and smaller organizations.
Vendor Lock-in: High-quality closed-source alternatives restrict customization and incur exorbitant usage costs.

Existing lightweight models often fail to match the fidelity of larger models or require equally massive data engineering efforts. There is a need for an efficient framework that compresses large foundation models without retraining from scratch.

2. Methodology

Amber-Image introduces a structured compression framework that transforms the 60-layer, 20B-parameter dual-stream MMDiT backbone of Qwen-Image into lightweight variants (10B and 6B) without training from scratch. The process preserves the original text encoder (Qwen2.5-VL-7B) and VAE, focusing compression solely on the MMDiT backbone.

The pipeline consists of two sequential stages:

Stage 1: Depth Pruning (Qwen-Image $\to$ Amber-Image-10B)

Layer Importance Estimation: Instead of gradient-based approximations, the authors use a global ablation-based scoring protocol. They measure the impact of removing a specific layer on the noise prediction fidelity across different timesteps. A dynamic weighting kernel ( $\omega_t$ ) is applied to prioritize early timesteps (global structure) over later ones (texture).
Pruning: The 30 least critical layers are removed, reducing the depth from 60 to 30 layers.
Fidelity-Aware Initialization: To prevent representational collapse, the retained layers are reinitialized via Local Weight Averaging. If a retained layer $l$ is followed by $k$ pruned layers, its weights are initialized as the arithmetic mean of itself and the $k$ pruned neighbors.
Recovery:
1. Targeted Distillation: Only the reinitialized layers are trained to match the hidden states of the original teacher model (Qwen-Image), while other layers remain frozen.
2. Global Fine-tuning: A brief full-parameter fine-tuning on a curated dataset (1M pairs) restores visual fidelity.

Stage 2: Hybrid-Stream Conversion (Amber-Image-10B $\to$ Amber-Image-6B)

Architecture Evolution: Recognizing that cross-modal redundancy increases in deeper layers, the authors convert the deep layers into a single-stream architecture.
- Layers 1–10: Retain the original dual-stream (text and image separate) for modality-specific feature extraction.
- Layers 11–30: Converted to a single shared stream, initialized from the image branch of the 10B teacher (as the image stream carries primary spatial priors).
Progressive Alignment:
1. Local Distillation: The 20 single-stream layers are trained to match the concatenated hidden states of both text and image branches from the 10B teacher. The first 10 layers are frozen to act as stable anchors.
2. Lightweight Fine-tuning: A final round of full-parameter fine-tuning aligns the latent distributions.

3. Key Contributions

Structured Depth Pruning with Fidelity-Aware Initialization: A novel method to safely remove 50% of layers in a 60-layer MMDiT by using timestep-sensitive importance scoring and local weight averaging for warm-start initialization.
Progressive Architectural Simplification: Introduction of a hybrid-stream architecture (Dual-Stream Early + Single-Stream Late) that reduces parameters by an additional 40% while maintaining cross-modal reasoning capabilities.
Two-Stage Knowledge Transfer without Large-Scale Data: A pipeline that relies on distillation and fine-tuning on a relatively small, high-quality dataset (1M pairs), eliminating the need for billion-scale data curation or training from scratch.
Extreme Cost Efficiency: The entire pipeline (from 20B to 6B) requires fewer than 2,000 GPU hours (approx. 10 days on 8x A100s), a fraction of the cost of training comparable models from scratch.

4. Experimental Results

The models were evaluated on DPG-Bench, GenEval, OneIG-Bench, LongText-Bench, and CVTG-2K.

General Generation (DPG-Bench & GenEval):
- Amber-Image-10B and 6B achieved State-of-the-Art (SOTA) overall scores, surpassing the 20B teacher (Qwen-Image), closed-source systems (Seedream 3.0, GPT Image 1), and other 7B-class open-source models.
- Notably, they excelled in compositional reasoning and spatial/attribute grounding, suggesting the compression pipeline preserved or even enhanced these capabilities.
Text Rendering (LongText-Bench & CVTG-2K):
- Amber-Image-10B outperformed closed-source baselines in bilingual text rendering and maintained high word accuracy.
- Amber-Image-6B showed a slight performance drop compared to the 10B variant and specialized text-rendering models (like Z-Image), particularly in complex multi-region scenarios, but still outperformed many larger baselines.
Aesthetics & Diversity (OneIG-Bench):
- The models showed a gap in "Style" and "Diversity" dimensions compared to top-tier models. The authors attribute this to the limited diversity of the fine-tuning dataset and the loss of aesthetic priors during aggressive compression.

5. Significance

Democratization of High-Performance T2I: Amber-Image demonstrates that high-fidelity, SOTA-level image generation is achievable with significantly fewer parameters (6B–10B) and drastically lower computational budgets.
Efficiency Paradigm: It shifts the focus from "bigger is better" to "smarter compression," proving that large foundation models can be distilled into efficient variants without the massive data engineering overhead previously thought necessary.
Practical Deployment: The resulting models are viable for deployment on consumer-grade hardware, enabling broader adoption of advanced generative AI in real-world applications.
Future Directions: The authors plan to integrate Reinforcement Learning from Human Feedback (RLHF) to address style/diversity gaps and develop ultra-lightweight (2–3B) domain-specific models for industrial use.

Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

1. The Problem: The "Over-Engineered" Kitchen

2. The Solution: The "Smart Downsize"

Trick A: The "Redundant Assistant" Audit (Depth Pruning)

Trick B: The "Hybrid Kitchen" (Single-Stream Conversion)

Trick C: The "Shadow Training" (Knowledge Distillation)

3. The Results: A Tiny Chef with a Giant's Skill

The Bottom Line

1. Problem Statement

2. Methodology

Stage 1: Depth Pruning (Qwen-Image →\to→ Amber-Image-10B)

Stage 2: Hybrid-Stream Conversion (Amber-Image-10B →\to→ Amber-Image-6B)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration

Stage 1: Depth Pruning (Qwen-Image $\to$ Amber-Image-10B)

Stage 2: Hybrid-Stream Conversion (Amber-Image-10B $\to$ Amber-Image-6B)