CompBench: Benchmarking Complex Instruction-guided Image Editing

Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Yao Hu, Zihan Wang, Yuan

Published 2026-03-24

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

Imagine you have a magical photo editor. You can tell it, "Make the dog wear a hat," and it does. That's the current state of AI image editing. But what if you asked it something trickier? Like, "Move the dog to the left of the tree, but make sure he's hiding behind the bush so only his ears show, and change his hat to a red one that matches the sunset?"

Current AI models usually crash and burn on requests like that. They get confused, mess up the background, or just ignore half your instructions.

This paper introduces CompBench, a new "final exam" designed to test just how good these AI editors really are at handling complex, real-world requests. Here is the breakdown using simple analogies:

1. The Problem: The "Baby Picture" Test

Think of existing benchmarks (tests used to grade AI) like baby pictures. They are simple, clean, and have very few objects.

The Issue: If you train a chef to only cook scrambled eggs on a white plate, they will look like a 5-star chef. But if you ask them to cook a complex banquet with 10 different dishes, intricate plating, and specific dietary restrictions, they might fail miserably.
The Reality: Current AI tests only ask for "scrambled eggs" (simple edits like "change the dog to a cat"). They don't test if the AI can handle a "banquet" (complex scenes with many objects, occlusions, and specific spatial relationships).

2. The Solution: CompBench (The "Culinary Olympics")

The authors created CompBench, which is like the Culinary Olympics for AI. Instead of simple eggs, they throw complex challenges at the models.

The Menu: They have 9 different types of difficult tasks, such as:
- Multi-Object Editing: "Remove the two zebras in the back but keep the one in the front."
- Action Editing: "Make the giraffe bend its neck down."
- Viewpoint Editing: "Move the camera to the right to reveal a building hidden behind a tree."
- Implicit Reasoning: "What would happen if the dog slipped on the snow?" (This requires the AI to imagine physics and consequences, not just swap pixels).

3. How They Built It: The "Architect and the Builder"

Creating these tests was hard because you can't just ask an AI to make them; the AI would make mistakes.

The Strategy: They used a Collaborative Framework.
- The Architect (AI): A smart AI looks at a complex photo and suggests an editing instruction.
- The Builder (Human): A human expert checks the suggestion. If the instruction is vague or the result looks weird, they fix it.
The Result: They ended up with over 3,000 high-quality, "perfect" examples where the instruction matches the result exactly. This ensures the test is fair.

4. The "Decoupling" Trick

One of their smartest ideas is Instruction Decoupling.

The Analogy: Imagine giving a builder a messy note: "Fix the room." The builder might fix the wrong thing.
The Fix: CompBench teaches the AI to break that messy note into four clear blueprints:
1. Location: Where is the object? (e.g., "Top left").
2. Appearance: What does it look like? (e.g., "Red and spotted").
3. Dynamics: What is it doing? (e.g., "Flying").
4. Objects: Which object is it? (e.g., "The fish").
  By separating these, the AI understands the instructions much better.

5. The Results: Who Passed the Test?

They tested 15 different AI models on this new exam.

The Findings: Most models failed the hard questions. They could handle simple edits but fell apart when asked to reason about space or complex interactions.
The Winners: A few models stood out, specifically Bagel, Qwen-Image-Edit, and FLUX.1 Kontext. These models are like the "Master Chefs" who can actually handle the banquet.
The Failure Mode: The biggest problem found was Hallucination. When asked to move an object, the AI often kept the object in the wrong place or distorted the background (like stretching a building like taffy).

6. Why This Matters

This paper is a wake-up call. It tells us that while AI image editors are getting better, they are still "babies" when it comes to complex reasoning.

The Future: To build the next generation of tools, we need to stop training AI on simple tasks and start teaching them to understand spatial logic, physics, and complex instructions.

In a nutshell: CompBench is a tough new gym for AI image editors. It proves that most are currently out of shape, but it also shows us exactly what exercises they need to do to become the super-intelligent editors of the future.

CompBench: Benchmarking Complex Instruction-guided Image Editing

1. The Problem: The "Baby Picture" Test

2. The Solution: CompBench (The "Culinary Olympics")

3. How They Built It: The "Architect and the Builder"

4. The "Decoupling" Trick

5. The Results: Who Passed the Test?

6. Why This Matters

1. Problem Statement

2. Methodology: CompBench Construction

A. Data Generation Framework

B. Instruction Decoupling Strategy

C. Dataset Statistics

3. Key Contributions

4. Experimental Results

5. Significance and Future Directions

CompBench: Benchmarking Complex Instruction-guided Image Editing

1. The Problem: The "Baby Picture" Test

2. The Solution: CompBench (The "Culinary Olympics")

3. How They Built It: The "Architect and the Builder"

4. The "Decoupling" Trick

5. The Results: Who Passed the Test?

6. Why This Matters

1. Problem Statement

2. Methodology: CompBench Construction

A. Data Generation Framework

B. Instruction Decoupling Strategy

C. Dataset Statistics

3. Key Contributions

4. Experimental Results

5. Significance and Future Directions

More like this