ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework

Imagine you want to teach a robot artist how to paint exactly what you imagine. You tell it, "Change the sky to sunset," or "Remove that ugly sign," or "Turn this photo into a comic book."

For a long time, teaching these robots was like trying to train a dog using only a few scraps of food. The data (the "scraps") was either too small, too messy, or too expensive to get because it required paying huge fees to use "super-robots" (commercial AI) to generate the examples.

ScaleEdit-12M is a new project that solves this by building a massive, free, and high-quality training kitchen using a team of robot helpers.

Here is the simple breakdown of how they did it:

1. The Problem: The "Expensive Chef" Dilemma

Previously, to get good training data, researchers had two bad options:

Option A: Use a cheap, open-source robot to make the data. But this robot was clumsy, making photos that looked weird or didn't follow instructions well.
Option B: Pay a "Super Chef" (like GPT-4o) to make the data. The results were delicious, but the bill was so high you could only cook a tiny meal. You couldn't feed the whole world.

2. The Solution: The "Robot Assembly Line" (ScaleEditor)

The authors built a new system called ScaleEditor. Think of this not as one robot, but as a factory assembly line run by a team of specialized robot workers who never get tired and cost nothing.

Here is how their "factory" works in three steps:

Step 1: The Scavengers (Source Expansion)
Imagine a team of robots scouring the entire internet, libraries, and even generating new pictures from scratch. They gather over 10 million different images—beaches, cities, people, animals—so the training data isn't just about cats and dogs. It's about everything.
Step 2: The Editors (Multi-Agent Synthesis)
This is the magic part. Instead of one robot trying to do everything, they use a Task Router (like a foreman).
- If the image is a building, the foreman sends it to the "Architecture Robot."
- If the image has text, it goes to the "Typo-Fixing Robot."
- If the instruction is complex (like "Make the sun look like a melting clock"), it goes to the "Logic Robot."
  Each specialized robot writes a specific instruction and edits the photo perfectly for that job. They work together to create 12 million unique "Before and After" pairs.
Step 3: The Inspectors (Quality Control)
Before the data is used, a team of "Inspectors" (another set of AI robots) checks every single photo. They ask:
- "Did the robot actually do what the instruction said?"
- "Does the photo look natural, or did it glitch?"
- "Is the lighting right?"
  If a photo fails even one check, it gets thrown in the trash. Only the perfect 12 million make the cut.

3. The Result: The "Super-Student"

They took this massive, high-quality dataset (called ScaleEdit-12M) and used it to train two popular open-source AI models (UniWorld-V1 and Bagel).

The Analogy:
Imagine you have a student who is okay at art.

Before: You gave them a few old, blurry magazines to study. They improved a little.
After: You gave them a library of 12 million perfect, high-definition art books with step-by-step instructions.
The Result: The student didn't just get a little better; they became a master.

4. Why This Matters

The paper shows that when they tested these new models:

They beat almost every other open-source model.
They performed just as well as models trained on expensive, commercial data (which cost millions to make).
They got really good at tricky tasks, like fixing text in a sign or changing the physics of an object (e.g., making an egg look cracked).

The Bottom Line

This paper proves you don't need to spend a fortune on "Super Chefs" to train great AI. By building a smart, automated team of robot workers (a multi-agent framework), you can cook up a massive, high-quality dataset for free.

It's like turning a small, messy home kitchen into a world-class, automated restaurant chain that feeds the entire AI community, proving that open-source collaboration can rival expensive, closed-door secrets.

ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework

1. The Problem: The "Expensive Chef" Dilemma

2. The Solution: The "Robot Assembly Line" (ScaleEditor)

3. The Result: The "Super-Student"

4. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: The ScaleEditor Framework

A. Source Image Expansion with World-Knowledge Infusion

B. Adaptive Multi-Agent Editing Synthesis

C. Task-Aware Quality Verification

3. Key Contributions

4. Experimental Results

5. Significance

ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework

1. The Problem: The "Expensive Chef" Dilemma

2. The Solution: The "Robot Assembly Line" (ScaleEditor)

3. The Result: The "Super-Student"

4. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: The ScaleEditor Framework

A. Source Image Expansion with World-Knowledge Infusion

B. Adaptive Multi-Agent Editing Synthesis

C. Task-Aware Quality Verification

3. Key Contributions

4. Experimental Results

5. Significance

More like this