SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

The paper introduces SpatialReward, a reward model that leverages explicit spatial reasoning to overcome the "Attention Collapse" limitation in existing evaluators, thereby providing fine-grained, accurate signals that significantly enhance online reinforcement learning performance for image editing tasks.

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are teaching a robot artist how to edit photos. You give it an instruction like, "Change the woman's scarf to a necklace, but keep her smile exactly the same."

The robot tries, but it accidentally changes her smile too, or maybe it makes the necklace look like it's floating in mid-air. You, the human teacher, need to tell the robot, "Good job on the necklace, but you messed up the smile. Try again."

This is the core problem the paper SpatialReward tries to solve. Here is the simple breakdown:

1. The Problem: The "Blind Spot" Robot

Current AI models that judge image edits (called "Reward Models") have a major flaw. The authors call this "Attention Collapse."

  • The Analogy: Imagine a student taking a test. The teacher asks them to compare Photo A (the original) and Photo B (the edited version) to find the differences.
  • The Mistake: Instead of looking at both photos side-by-side, the student gets distracted by Photo B alone. They look at the necklace and say, "Wow, that's a beautiful necklace!" and give it a perfect score. They completely ignore the fact that the student also accidentally erased the student's ear or changed the background.
  • The Result: The AI thinks the edit is perfect because it's not paying attention to what didn't change. It fails to notice the "collateral damage."

2. The Solution: "Think with Boxes"

The authors built a new AI judge called SpatialReward. Instead of just staring at the whole picture, this new judge is forced to play a game of "Where's Waldo?" before it gives a grade.

  • The Analogy: Before writing its report card, the AI must first draw a box around the specific things that were supposed to change (e.g., a box around the scarf area).
  • The Magic: Once it draws the box, it is forced to look inside that box in both the original and new photo. It has to say, "Okay, inside this box, the scarf is gone, and the necklace is there. Good."
  • The Safety Net: Then, it looks at the rest of the picture (the parts outside the box) to make sure nothing else changed. "Wait, the ear is gone? That's bad!"

By forcing the AI to draw these "boxes" (spatial reasoning), it stops being blind to the original image. It can't just hallucinate that the edit is perfect; it has to prove it by pointing to the pixels.

3. The Training: The "260,000 Page" Textbook

To teach this AI to be so good at drawing boxes and checking details, the researchers created a massive, custom textbook called SpatialReward-260K.

  • They didn't just show the AI pictures; they showed it pictures with the boxes already drawn and detailed notes explaining why an edit was good or bad.
  • They used a "Teacher" AI (like a super-smart GPT-5) to write these notes, and then trained their model to mimic that careful, box-by-box thinking.

4. The Result: A Master Art Critic

When they tested this new AI judge:

  • On Benchmarks: It beat all other existing judges, including expensive, proprietary ones from big tech companies. It was much better at spotting tiny errors (like a missing ear or a weird shadow).
  • In the Real World: They used this AI to train a robot artist (OmniGen2). Because the judge was so strict and accurate, the robot artist learned much faster. It improved its editing skills by double the amount compared to using the old, "blind" judges.

Summary

Think of SpatialReward as a strict art teacher who refuses to grade a painting until the student points out exactly what they changed and proves they didn't ruin anything else. By forcing the AI to "think with boxes," they fixed the "blind spot" problem, making AI image editing much more reliable and precise.