DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

Imagine you have a magical photo editor that can listen to your voice and change things in a picture. You say, "Make that tiny red bird blue," and poof! It happens. This is the dream of Instruction-based Image Editing Models (IIEMs).

But here's the catch: While these AI models are great at changing big things (like turning a whole sky blue or swapping a car for a truck), they often get completely lost when you ask them to fix something tiny, like a speck of dust, a small logo, or a single button on a shirt. They might change the wrong thing, or they might accidentally ruin the rest of the photo while trying to fix the small detail.

This paper introduces DLEBench (DeepLookEditBench), a new "stress test" designed specifically to see how good these AI models are at editing tiny objects.

Here is a breakdown of the paper using simple analogies:

1. The Problem: The "Needle in a Haystack" Issue

Think of current image editing benchmarks like a game where you have to find and change a giant elephant in a photo. Most AI models can do this easily.
But DLEBench is like asking the AI to find and change a single grain of rice in a bowl of soup.

The Issue: When the object is small (occupying only 1% to 10% of the image), the AI often gets confused. It might change the wrong grain of rice, or it might spill the whole bowl of soup while trying to pick up one grain.
The Goal: The authors built a dataset of 1,889 tiny-object challenges to see which AI models can actually handle this "needle in a haystack" task without messing up the rest of the picture.

2. Building the Test: The "Recipe" for Tiny Edits

Creating these test cases was hard because you can't just ask a human to draw 1,889 tiny edits manually; it would take forever.

The Solution: The team built a semi-automated assembly line.
- Step 1: They took existing puzzles (visual reasoning questions like "What color is the scarf?") and turned them into editing instructions ("Change the scarf from red to green").
- Step 2: They used a special "crop-and-edit" trick. Instead of asking the AI to edit the whole huge photo, they cut out a small piece containing just the tiny object, edited that piece perfectly, and then put it back. This created a "Gold Standard" answer (the Reference Image) that humans could use to grade the AI.
- Step 3: Humans checked the work to make sure the "tiny object" was actually tiny and the instructions made sense.

3. The Grading System: The "Detective" vs. The "Human"

How do you grade an AI's work? Usually, we use other AIs to judge the results (like a robot teacher grading a student). But the authors found that standard "Robot Teachers" (Large Multimodal Models) are blind to tiny details. They might say, "Looks good!" even if the AI changed the wrong object.

To fix this, they created a Dual-Mode Grading System:

Mode A: The Tool-Driven Detective. Instead of just looking at the picture, the grading AI is given a magnifying glass, a highlighter, and a ruler (digital tools). It can zoom in, compare pixels, and highlight differences. This helps it see the tiny changes it would otherwise miss.
Mode B: The Oracle-Guided Pro. In this mode, the grader is given the exact coordinates of the tiny object (like a treasure map). It doesn't have to search; it just looks at the specific spot to see if the edit was done right. This removes the confusion of "where is the object?"

4. The Results: The "Big Fish" vs. The "Small Fry"

The authors tested 10 different AI models (both free and paid) on this new test.

The Shock: The most famous, expensive, "closed-source" models (like the ones from Google or OpenAI) were not the best at this task. In fact, some open-source models actually performed better!
The Main Finding: Almost all models struggled.
- Localization Failure: Many models couldn't even find the tiny object. They changed the background instead.
- Over-Modification: Some found the object but ruined its shape or texture while changing its color.
- The "Change Count" Nightmare: Asking an AI to change the number of tiny objects (e.g., "Add two more tiny bees") was the hardest task of all. Most models failed miserably.

5. Why This Matters

Imagine you are using an AI to fix a photo of your family. You want to remove a tiny speck of dust from your grandmother's glasses.

Without DLEBench: The AI might remove the whole face, or change the color of the glasses to purple.
With DLEBench: We now have a way to measure exactly how good an AI is at these delicate tasks. This forces developers to build better models that can handle the "fine print" of image editing, not just the big picture.

Summary

DLEBench is a new, difficult exam for AI image editors. It proves that while AI is getting smarter at big changes, it is still terrible at small, precise ones. The paper also gives us a better way to grade these AIs, ensuring that in the future, when we ask for a tiny fix, we actually get a tiny fix—and not a disaster.

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

1. The Problem: The "Needle in a Haystack" Issue

2. Building the Test: The "Recipe" for Tiny Edits

3. The Grading System: The "Detective" vs. The "Human"

4. The Results: The "Big Fish" vs. The "Small Fry"

5. Why This Matters

Summary

1. Problem Statement

2. Methodology

A. Benchmark Construction: DLEBench

B. Evaluation Protocol

3. Key Contributions

4. Experimental Results

5. Significance

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

1. The Problem: The "Needle in a Haystack" Issue

2. Building the Test: The "Recipe" for Tiny Edits

3. The Grading System: The "Detective" vs. The "Human"

4. The Results: The "Big Fish" vs. The "Small Fry"

5. Why This Matters

Summary

1. Problem Statement

2. Methodology

A. Benchmark Construction: DLEBench

B. Evaluation Protocol

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems