InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Imagine you are asking a chef to cook a complex meal. You give them the raw ingredients (the starting image) and the finished dish (the final image).

Most current AI image editors are like chefs who can magically snap their fingers and turn the raw ingredients into the finished dish instantly. They are great at the "before" and the "after." But if you ask them, "Show me the step-by-step process of how you chopped the onions, sautéed the garlic, and simmered the sauce," they usually just stare blankly or give you a jumbled mess. They know the destination, but they don't understand the journey.

InEdit-Bench is a new "driving test" for AI image editors designed to fix this. It doesn't just ask, "Can you get from Point A to Point B?" It asks, "Can you show me the map of every turn, stop, and traffic light along the way?"

Here is a breakdown of the paper using simple analogies:

1. The Problem: The "Teleportation" Trap

Current AI models are like teleporters. They can take you from your living room to the moon instantly. But they can't explain the physics of the rocket, the fuel burn, or the orbit.

The Issue: AI is great at single-step edits (e.g., "Make the sky blue"). But it fails at multi-step reasoning (e.g., "Show me how a caterpillar turns into a butterfly, step-by-step"). It often skips steps, reverses time, or creates impossible physics (like a building collapsing upwards).

2. The Solution: InEdit-Bench (The "Journey Map" Test)

The researchers created a new benchmark called InEdit-Bench. Think of it as a logic puzzle for AI.

The Input: You give the AI a "Start" photo and an "End" photo.
The Task: The AI must generate a comic strip (a grid of images) showing the logical steps in between.
The Goal: The comic strip must make sense. If you are turning a lump of clay into a vase, the clay shouldn't suddenly turn into a bird in step 3 and then back to clay in step 4.

3. The Four Types of Challenges

The test covers four different "genres" of storytelling, just like a library:

State Transition (The "Lego Builder"):
- Example: Scattered Lego bricks $\rightarrow$ A finished castle.
- The Test: Did the AI show the bricks snapping together in the right order? Or did it just paste the finished castle on top of the pile?
Dynamic Process (The "Action Movie"):
- Example: A person running $\rightarrow$ Jumping over a hurdle.
- The Test: Does the runner's leg lift before they jump? Or does the jump happen before the leg moves? The AI must understand physics and motion.
Temporal Sequence (The "Time-Lapse"):
- Example: A flower bud $\rightarrow$ A full bloom.
- The Test: Does the flower open slowly and naturally over time? Or does it pop open instantly?
Scientific Simulation (The "Science Class"):
- Example: Mixing vinegar and baking soda.
- The Test: Does the AI know that bubbles form because of the reaction? It checks if the AI actually understands science or is just guessing.

4. The Grading System (The "Judge")

How do they grade the AI? They don't just look at the pictures; they use a smart AI Judge (a Large Multimodal Model) to act like a strict teacher. The teacher checks six things:

Appearance: Do the pictures look nice and clear?
Logic: Does Step 2 actually follow Step 1? (No time travel allowed!)
Science: Is the physics correct? (Does the water flow down, not up?)
Consistency: Did the character's shirt change color randomly in the middle of the story?
Process: Did the AI actually show the process, or did it just skip to the end?
Path: If you asked for a specific way to do it (e.g., "paint from top to bottom"), did the AI follow that rule?

5. The Results: The AI is Still a "Toddler"

The researchers tested 14 different AI models (including big names like GPT-4 and various open-source models).

The Score: Even the best AI only got about 16% of the questions "perfectly" right.
The Reality Check: Most models struggled to create a logical sequence. They often produced images that looked pretty but made no sense logically (like a car driving on the ceiling).
The Gap: "Proprietary" models (the big, expensive ones from tech giants) did better than "Open Source" models, but none of them are truly ready for complex, multi-step reasoning yet.

Why Does This Matter?

Imagine you want an AI to help you design a new building, edit a movie scene, or simulate a medical procedure. You can't just say "Make it happen." You need the AI to understand the steps to get there.

InEdit-Bench is a wake-up call. It tells the AI world: "You are great at painting the final picture, but you need to learn how to paint the story behind it." It sets a new standard to push AI from being a "magic trick" to becoming a true "reasoning partner."

1. Problem Statement

Current multimodal generative models have achieved significant success in static image editing (e.g., single-step edits like changing a color or style). However, they struggle with dynamic reasoning and multi-step procedural tasks.

The Gap: Existing models can often map a start state to an end state but fail to generate the coherent, intermediate logical pathways required to bridge the two. They lack the ability to model the causal, sequential, and temporal evolution of visual content.
The Limitation: Current benchmarks primarily evaluate the final output (static reasoning or instruction following) and lack metrics to assess the logical coherence, scientific plausibility, and process fidelity of the intermediate steps. This leaves a critical blind spot in understanding how models reason about complex transformations.

2. Methodology: InEdit-Bench

The authors introduce InEdit-Bench, the first benchmark dedicated to evaluating the generation of intermediate logical pathways in image editing.

A. Dataset Construction

Scale: 237 high-quality, hand-annotated test cases.
Task Categories: The dataset is divided into four fundamental domains, covering 16 distinct sub-tasks:
1. State Transition: Discrete changes (e.g., construction/assembly, decoration, layout organization, deformation).
2. Dynamic Process: Continuous fluid transformations (e.g., biology/nature, coordinated motion, daily life, mechanical operations, sudden events).
3. Temporal Sequence: Time-driven evolution (e.g., environmental changes, growth/decay, physical transformation, temporal measurement).
4. Scientific Simulation: Adherence to physical/chemical/biological laws (e.g., diffusion, combustion, cell division).
Input/Output Format: Each instance provides an initial image, a final image, and a text prompt. Models are required to generate a single image containing $N$ grids, where each grid represents a distinct stage of the logical transition.

B. Evaluation Metrics

InEdit-Bench employs a six-dimensional evaluation framework, utilizing a Large Multimodal Model (LMM, specifically GPT-4o) as a judge ("LMM-as-a-Judge") to automate scoring. The metrics are split into two groups:

Foundational Visual Quality (Adapted from existing benchmarks):
- Appearance Consistency: Preservation of style and attributes across stages.
- Perceptual Quality: Realism, absence of artifacts, and fidelity.
- Semantic Consistency: Alignment of the final content with the editing objective.
Process-Oriented Metrics (Novel to this benchmark):
- Logical Coherence: Evaluates the natural flow and causal connection between adjacent stages (checking for regressions, skips, or redundancies).
- Scientific Plausibility: Uses knowledge checklists to verify adherence to scientific laws (e.g., physics of fluid dynamics, chemical reaction steps).
- Process Plausibility: Tests the model's ability to follow specific, non-deterministic path constraints (e.g., "paint top-to-bottom" vs. "bottom-to-top") to ensure the model understands the process rather than just the result.

3. Key Contributions

First Benchmark for Intermediate Reasoning: InEdit-Bench shifts the evaluation paradigm from "destination-only" to "pathway-aware," focusing on the intermediate logical steps required for complex visual manipulation.
Comprehensive Taxonomy & Metrics: It establishes a structured taxonomy of 4 domains and 16 sub-tasks, alongside a novel 6-dimensional evaluation protocol that includes scientific and logical plausibility.
Extensive Empirical Analysis: The paper provides a rigorous evaluation of 14 representative models (both proprietary and open-source), revealing systemic weaknesses in current architectures regarding multi-step reasoning.

4. Experimental Results

The authors evaluated 14 models, including proprietary giants (GPT-Image-1, Nano-Banana, Flux-Kontext-pro) and open-source models (Qwen-Image-Edit, Bagel, OmniGen, Emu series).

Overall Performance: Current models exhibit significant shortcomings. Even the best-performing model, GPT-Image-1, achieved an overall average score of 81.33 and an accuracy of only 16.75% (where "accuracy" implies all metrics reached the maximum score).
Proprietary vs. Open-Source: Proprietary models generally outperform open-source models, particularly in logical coherence and semantic consistency. However, even top-tier proprietary models struggle with complex scientific simulations and state transitions.
Task Difficulty:
- State Transition and Scientific Simulation tasks proved the most difficult, with scores dropping significantly compared to Dynamic Process tasks.
- Many open-source models scored near 0% on State Transition and Scientific Simulation tasks.
Specific Weaknesses:
- Logical Coherence: Models frequently generate redundant steps, skip critical transitions, or exhibit logical regressions (e.g., an object un-dissolving).
- Scientific Plausibility: Models often fail to adhere to physical laws (e.g., incorrect fluid dynamics or biological processes).
- Path Constraints: Models struggle to follow specific procedural constraints (e.g., specific ordering of steps), often defaulting to a generic "best guess" rather than the requested path.

5. Significance and Future Directions

Catalyst for Research: InEdit-Bench highlights that current generative models are "static" thinkers. To achieve true intelligence in visual manipulation, models must develop procedural reasoning and causal understanding.
Standardization: It provides a standardized, challenging testbed to drive the development of models capable of dynamic, reason-aware, and multi-step planning.
Future Work: The results suggest that future research must focus on improving long-term dependency capture, integrating external scientific knowledge, and training models to explicitly reason about intermediate states rather than just optimizing for the final image.

In summary, InEdit-Bench exposes a critical gap in the current generation of AI image editors: the inability to reason through the "how" of a transformation, not just the "what." It sets a new standard for evaluating and improving the logical and scientific reasoning capabilities of multimodal models.

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

1. The Problem: The "Teleportation" Trap

2. The Solution: InEdit-Bench (The "Journey Map" Test)

3. The Four Types of Challenges

4. The Grading System (The "Judge")

5. The Results: The AI is Still a "Toddler"

Why Does This Matter?

1. Problem Statement

2. Methodology: InEdit-Bench

A. Dataset Construction

B. Evaluation Metrics

3. Key Contributions

4. Experimental Results

5. Significance and Future Directions

More like this

AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics

Domain-Specialized Tree of Thought through Plug-and-Play Predictors

FactorSmith: Agentic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement

Me, Myself, and π\piπ : Evaluating and Explaining LLM Introspection

Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection