Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

The paper introduces Amber-Image, a cost-efficient compression framework that transforms the massive 60-layer Qwen-Image into lightweight 10B and 6B variants through depth pruning, hybrid-stream architecture, and distillation, achieving high-fidelity text-to-image generation with significantly reduced parameters and training costs.

Chaojie Yang, Tian Li, Yue Zhang, Jun Gao

Published 2026-02-20
📖 4 min read☕ Coffee break read

Imagine you have a giant, world-class chef (the original 60-layer AI model) who can cook absolutely anything. This chef is incredibly talented, but they are also huge: they need a massive kitchen, a team of 50 assistants, and a fortune in ingredients to make a single meal. Most people can't afford to hire them or even fit them in their home kitchen.

The paper you shared introduces Amber-Image, which is like a brilliant culinary school that teaches this giant chef how to downsize into a compact, efficient home cook without losing their ability to make gourmet meals.

Here is how they did it, broken down into simple steps:

1. The Problem: The "Over-Engineered" Kitchen

Current top-tier AI image generators (like the one they started with, called Qwen-Image) are like those giant chefs. They have 60 layers of "thinking" steps. To make an image, the AI has to pass the idea through all 60 layers. This takes a massive amount of computer power (GPU hours) and money. It's like using a nuclear reactor to boil an egg.

2. The Solution: The "Smart Downsize"

The researchers didn't build a new chef from scratch (which would take years and millions of dollars). Instead, they took the existing giant chef and compressed them. They created two smaller versions: Amber-Image-10B and Amber-Image-6B.

They used three clever tricks to do this:

Trick A: The "Redundant Assistant" Audit (Depth Pruning)

Imagine the 60 layers of the chef's brain as 60 assistants passing a recipe down a line. The researchers realized that some assistants were just whispering the same thing the previous assistant said. They didn't add much new value.

  • What they did: They identified the 30 "least important" assistants and let them go.
  • The Magic: Instead of just deleting them and leaving a gap, they took the knowledge of the fired assistants and blended it into the remaining ones. It's like taking the notes from the fired assistants and pasting them into the notebooks of the remaining staff so the team still knows everything. This cut the model size in half immediately.

Trick B: The "Hybrid Kitchen" (Single-Stream Conversion)

In the original chef's kitchen, there were two separate teams: one for "Text" (reading the recipe) and one for "Image" (cooking the food). They worked in parallel.

  • What they did: For the first part of the cooking process, they kept both teams separate because they need to focus on different things. But for the later stages (when the food is actually being plated), they realized the teams were doing very similar things.
  • The Magic: They merged the two teams into one super-team for the final 20 steps. This saved even more space and energy, creating the even smaller Amber-Image-6B.

Trick C: The "Shadow Training" (Knowledge Distillation)

When you fire half the staff and merge the teams, the kitchen might get chaotic. The food might taste wrong at first.

  • What they did: They didn't throw away the giant chef. They kept the original 60-layer chef in the room as a teacher.
  • The Magic: The new, smaller team worked on a few thousand high-quality recipes while the giant chef watched. Whenever the small team made a mistake, the giant chef corrected them. This "shadow training" happened very quickly and didn't require millions of new recipes. It just required the small team to mimic the big team's style.

3. The Results: A Tiny Chef with a Giant's Skill

The results were shocking.

  • Speed & Cost: The whole process of shrinking the model took less than 2,000 GPU hours. To put that in perspective, training a new model from scratch usually takes tens of thousands of hours. It's the difference between a weekend project and a decade-long construction.
  • Quality: The new "small" chefs (Amber-Image) could cook meals that were just as delicious as the giant one. In fact, on many tests (like following complex instructions or drawing specific objects), they actually did better than the original giant chef and even beat some of the most expensive, closed-source systems in the world.
  • Text: They are particularly good at writing words inside images (like drawing a sign that says "Open"), which is usually very hard for AI.

The Bottom Line

The paper proves that you don't need a supercomputer and a billion dollars to make amazing AI art. By being smart about cutting out the fluff and teaching the small model to copy the big one, you can get 90% of the performance with 30% of the cost.

It's like taking a Formula 1 race car, removing the unnecessary aerodynamics and heavy armor, and turning it into a sleek, fast sports car that you can actually drive on the street, all while keeping the engine's power intact.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →