Instruction-based Image Editing with Planning, Reasoning, and Generation

This paper proposes a novel framework for instruction-based image editing that bridges understanding and generation by leveraging a multi-modality chain-of-thought approach to separately handle planning, editing region reasoning, and hint-guided generation, thereby achieving superior performance on complex real-world images compared to prior single-modality methods.

Liya Ji, Chenyang Qi, Qifeng Chen

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you want to hire a professional painter to fix a photo for you. In the past, you might have had to say something vague like, "Make this look nicer," and hope the painter guessed what you meant. Or, you might have had to give a very long, complicated list of technical steps, hoping the painter didn't get confused.

This paper introduces a new way to talk to an AI artist called "Multimodal Chain-of-Thought Editing." Think of it as hiring a Project Manager and a Foreman to work alongside your painter, rather than just talking to the painter directly.

Here is how it works, broken down into simple steps:

1. The Problem: The "Vague Instruction" Trap

If you tell a standard AI, "Make the room warmer," it might just turn the whole picture orange. It doesn't know what to change (the walls? the furniture? the lighting?) or how to do it without ruining the rest of the photo. It's like telling a chef, "Make this soup tastier," without saying whether to add salt, pepper, or herbs.

2. The Solution: The "Three-Step Team"

The authors created a system that breaks your request down into three distinct roles, working together like a construction crew:

Step 1: The Project Manager (The Planner)

  • What it does: This is the "brain" of the operation. When you give a complex instruction (e.g., "Make it a dramatic stormy night"), the Project Manager doesn't just pass the message along. It stops and thinks.
  • The Analogy: Imagine you ask a contractor to "fix the kitchen." A good contractor doesn't just start hammering. They break it down: "First, we need to remove the old cabinets. Second, we need to install new lighting. Third, we need to repaint the walls."
  • In the paper: The AI uses a "Chain-of-Thought" to break your big, abstract idea into a list of small, specific, doable tasks. It translates "dramatic" into "add dark clouds, lightning, and choppy waves."

Step 2: The Foreman (The Reasoner)

  • What it does: Once the Project Manager has the list of tasks, the Foreman looks at the photo and points exactly where to work.
  • The Analogy: If the Project Manager says, "Paint the sky," the Foreman draws a line in the sand to show the painter exactly where the sky ends and the mountains begin. It prevents the painter from accidentally painting the mountains blue.
  • In the paper: A special AI model looks at the image and the specific task, then draws a "mask" (a digital stencil) showing exactly which pixels need to change. This ensures the AI doesn't accidentally erase the person standing in the photo while changing the background.

Step 3: The Painter (The Generator)

  • What it does: This is the actual AI that creates the new image. But now, it isn't guessing. It has a blueprint (the specific task) and a stencil (the exact area to paint).
  • The Analogy: The painter now knows exactly what to do and where to do it. They can focus all their energy on making the clouds look realistic because they aren't worried about messing up the rest of the picture.
  • In the paper: This is a "Diffusion Model" (the current state-of-the-art image generator) that is guided by the hints from the first two steps.

3. Why is this a Big Deal?

  • It handles "Abstract" ideas: Humans are good at understanding words like "cozy," "spooky," or "dramatic." Old AI models struggled with these. This system translates those fuzzy feelings into concrete actions (e.g., "Cozy" = "Add warm blankets and a soft lamp").
  • It doesn't break the photo: Because the "Foreman" tells the AI exactly where not to touch, the rest of the image stays perfect.
  • It's like a conversation: You can talk to it naturally, and it figures out the logic behind your request before doing the work.

Summary Analogy

Think of editing an image with the old way as shouting a command to a blindfolded artist and hoping they get it right.

This new method is like giving the artist a detailed blueprint, a laser-guided stencil, and a step-by-step checklist. The result is a photo that looks exactly like what you imagined, with no accidental smudges or missing details.

The authors tested this on thousands of images and found that it creates much better, more accurate edits than previous methods, especially for complex or creative requests.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →