InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

This paper presents InterCoG, a novel text-vision interleaved chain-of-grounding reasoning framework that enhances fine-grained image editing in complex multi-entity scenes by explicitly deducing target locations through text-based spatial reasoning before performing visual grounding and outcome specification, supported by a new dataset and auxiliary training modules to ensure spatial precision.

Yecong Wan, Fan Li, Chunwei Wang, Hao Wu, Mingwen Shao, Wangmeng Zuo

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you are an art director giving instructions to a very talented, but slightly literal-minded, digital painter.

The Problem: The "Which One?" Confusion
You point to a crowded photo of a street scene and say, "Change the shirt of the woman standing between the two kids to green."

Older AI models often get confused. They might think, "Which woman? There are ten of them!" or "Which kids? Are they the ones on the left or right?" Because they struggle to understand the relationships between objects (like "between," "next to," or "holding hands"), they might paint the wrong person's shirt green, or accidentally change the color of the whole street. They are great at painting, but bad at figuring out exactly where to paint.

The Solution: InterCoG (The "Detective Painter")
The paper introduces a new framework called InterCoG. Think of InterCoG not just as a painter, but as a detective who is also a painter. Instead of jumping straight to painting, InterCoG follows a strict, step-by-step detective process before touching the canvas.

Here is how it works, using a simple analogy:

1. The Textual Detective (The "Mental Map")

First, InterCoG reads your instruction and looks at the photo, but it doesn't touch the image yet. It acts like a detective writing a report in its notebook.

  • It thinks: "Okay, I see a group of people. There is a woman in the middle. She is holding a child's hand. She is flanked by two other children. That must be the target."
  • Analogy: It's like reading a map and saying, "The treasure is under the big oak tree, next to the red bench," before you even start digging. It figures out the logic of the scene using words.

2. The Visual Detective (The "Highlighter")

Once the detective is sure about the location, it doesn't just guess. It pulls out a highlighter and draws a box and a mask over the specific person it identified in step 1.

  • It does: It highlights the woman in red on the screen.
  • Analogy: This is like putting a "Target Acquired" sticker on the exact person you want to edit. It proves to itself, "Yes, this is the one I was talking about."

3. The Rewriter (The "Clarifier")

Now that it knows exactly who it is editing, it rewrites your original instruction to be super precise.

  • It changes: "Change the shirt of the woman between the kids" becomes "Change the blue shirt of the specific woman highlighted in the red box to green."
  • Analogy: It's like a translator making sure the painter understands the exact nuance of the request, removing any ambiguity.

4. The Painter (The "Execution")

Finally, with the target locked in, the logic confirmed, and the instruction clarified, the painter goes to work. Because it knows exactly where to look, the result is perfect. The woman's shirt turns green, and the kids next to her stay exactly the same.

Why is this special?

Most other AI models try to guess the location immediately. If the scene is complex (like a crowd, a busy market, or a forest), they get lost.

InterCoG is special because it interleaves (mixes) thinking and seeing. It doesn't just "think" then "do." It thinks, checks its work visually, thinks again, and then does. It's like a chef who tastes the soup, adds a pinch of salt, tastes it again, and then serves it, rather than just guessing the recipe once and hoping for the best.

The "Training School" (GroundEdit-45K)

To teach this AI to be such a good detective, the researchers built a massive training school called GroundEdit-45K.

  • They didn't just show the AI pictures and say "fix this."
  • They showed the AI pictures and forced it to write out the detective's report (the reasoning steps) before fixing the picture.
  • They also created a "test exam" (GroundEdit-Bench) where the AI has to find hidden targets in tricky situations.

The Result

The paper shows that InterCoG is much better at these tricky, "find the needle in the haystack" tasks than previous models. It doesn't just change pixels; it understands the story of the image and edits the right part of the story without messing up the rest.

In short: InterCoG is an AI that refuses to guess. It thinks, verifies, and then acts, ensuring that when you ask for a specific change in a crowded room, it changes exactly what you meant, and nothing else.