SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

Imagine you are teaching a robot artist how to edit photos. You give it an instruction like, "Change the woman's scarf to a necklace, but keep her smile exactly the same."

The robot tries, but it accidentally changes her smile too, or maybe it makes the necklace look like it's floating in mid-air. You, the human teacher, need to tell the robot, "Good job on the necklace, but you messed up the smile. Try again."

This is the core problem the paper SpatialReward tries to solve. Here is the simple breakdown:

1. The Problem: The "Blind Spot" Robot

Current AI models that judge image edits (called "Reward Models") have a major flaw. The authors call this "Attention Collapse."

The Analogy: Imagine a student taking a test. The teacher asks them to compare Photo A (the original) and Photo B (the edited version) to find the differences.
The Mistake: Instead of looking at both photos side-by-side, the student gets distracted by Photo B alone. They look at the necklace and say, "Wow, that's a beautiful necklace!" and give it a perfect score. They completely ignore the fact that the student also accidentally erased the student's ear or changed the background.
The Result: The AI thinks the edit is perfect because it's not paying attention to what didn't change. It fails to notice the "collateral damage."

2. The Solution: "Think with Boxes"

The authors built a new AI judge called SpatialReward. Instead of just staring at the whole picture, this new judge is forced to play a game of "Where's Waldo?" before it gives a grade.

The Analogy: Before writing its report card, the AI must first draw a box around the specific things that were supposed to change (e.g., a box around the scarf area).
The Magic: Once it draws the box, it is forced to look inside that box in both the original and new photo. It has to say, "Okay, inside this box, the scarf is gone, and the necklace is there. Good."
The Safety Net: Then, it looks at the rest of the picture (the parts outside the box) to make sure nothing else changed. "Wait, the ear is gone? That's bad!"

By forcing the AI to draw these "boxes" (spatial reasoning), it stops being blind to the original image. It can't just hallucinate that the edit is perfect; it has to prove it by pointing to the pixels.

3. The Training: The "260,000 Page" Textbook

To teach this AI to be so good at drawing boxes and checking details, the researchers created a massive, custom textbook called SpatialReward-260K.

They didn't just show the AI pictures; they showed it pictures with the boxes already drawn and detailed notes explaining why an edit was good or bad.
They used a "Teacher" AI (like a super-smart GPT-5) to write these notes, and then trained their model to mimic that careful, box-by-box thinking.

4. The Result: A Master Art Critic

When they tested this new AI judge:

On Benchmarks: It beat all other existing judges, including expensive, proprietary ones from big tech companies. It was much better at spotting tiny errors (like a missing ear or a weird shadow).
In the Real World: They used this AI to train a robot artist (OmniGen2). Because the judge was so strict and accurate, the robot artist learned much faster. It improved its editing skills by double the amount compared to using the old, "blind" judges.

Summary

Think of SpatialReward as a strict art teacher who refuses to grade a painting until the student points out exactly what they changed and proves they didn't ruin anything else. By forcing the AI to "think with boxes," they fixed the "blind spot" problem, making AI image editing much more reliable and precise.

1. Problem Statement

The paper addresses a critical bottleneck in Online Reinforcement Learning (RL) for instruction-guided image editing: the lack of reliable, fine-grained reward signals. Current reward models suffer from a phenomenon the authors term "Attention Collapse."

The Perception Gap: Existing evaluators (including advanced Multimodal LLMs like GPT-4.1 and specialized models like EditScore) often fail to perform rigorous cross-image comparisons. Instead of comparing the edited image against the source image to check for consistency, they tend to "collapse" their attention, evaluating the edited image in isolation.
Consequences: This leads to "blind judgments" where models approve edits that successfully follow instructions but inadvertently destroy the source identity (e.g., changing the background style, altering non-target objects, or distorting poses).
Limitations of Current Methods:
- Pairwise rewards are computationally expensive ( $O(N^2)$ ) and provide relative rather than absolute signals.
- Discriminative pointwise models lack explicit reasoning paths and rely on costly human labels.
- Generative "Judge" models (MLLM-as-a-judge) often hallucinate or miss fine-grained inconsistencies because they lack spatial anchors to ground their reasoning in specific pixel regions.

2. Methodology: SpatialReward

The authors propose SpatialReward, a framework that integrates explicit spatial reasoning into generative pointwise evaluation to prevent attention collapse.

A. Core Architecture: "Think-with-Boxes"

The model employs a two-stream inference process that forces the model to "locate before it judges":

Semantic Consistency (SC) Stream:
- Step 1 (Localization): The model first predicts bounding boxes ( $B$ ) for all edited objects.
- Step 2 (Anchored Verification): The model generates a textual rationale ( $T$ ) where it must cite specific box tokens (e.g., <|bbox_0|>) to "look back" at the physical pixels of the source and edited images. This enforces cross-verification.
- Step 3 (Scoring): It outputs scalar scores ( $s$ ) for instruction following and source consistency.
Perceptual Quality (PQ) Stream:
- Operates in a reference-free mode, analyzing only the edited image for naturalness and artifacts.

B. Data Pipeline: SPATIALREWARD-260K

To train this capability, the authors constructed a high-quality dataset of 260k samples using a Spatial-Prior-Guided Pipeline:

Step I (Grounding): A robust VLM (Qwen-3-VL) generates bounding boxes for all samples to establish spatial priors.
Step II (Expert Routing): Samples are routed to specialized models based on content (e.g., Gemini-2.5-Pro for human faces, GPT-5 for general objects) to generate initial reasoning and scores.
Step III (Alignment & Verification): The reasoning is refined by the VLM to ensure consistency with the generated boxes. Samples with hallucinations (where reasoning contradicts visual evidence) are discarded.
Composition: The dataset combines refined EditScore data, re-purposed EditReward data, and a custom Multi-Edit set.

C. Training Strategy

The model (based on Qwen-3-VL-8B) undergoes a two-stage training process:

Supervised Fine-Tuning (SFT): Trained on the 260k dataset to learn the "Think-with-Boxes" structure and spatial grounding.
Online Consistency RL (GRPO): Uses Group Relative Policy Optimization to further suppress hallucinations. A stronger model (Gemini-3.0-Flash) acts as an "Oracle" to provide consistency rewards, penalizing ungrounded reasoning.

D. Reward Aggregation

The final reward is a Weighted Geometric Mean of Semantic Consistency (SC) and Perceptual Quality (PQ). This formulation ensures that a failure in one dimension (e.g., destroying the source image) heavily penalizes the total score, preventing "reward hacking."

3. Key Contributions

Identification of "Attention Collapse": The paper formally identifies the lack of spatial anchors as the root cause of evaluation failures in image editing, where models ignore source context.
SpatialReward Framework: The first framework to integrate explicit spatial reasoning (bounding boxes) into generative pointwise evaluation, enforcing pixel-level verification.
SPATIALREWARD-260K: A large-scale, spatially-aware dataset with high-quality reasoning traces.
MultiEditReward-Bench (MER-Bench): A new benchmark designed to rigorously test spatial perception and multi-constraint reasoning, featuring complex multi-region compositions.
Online RL Integration: Demonstrating that SpatialReward serves as a superior signal for training image editing policies via Online RL.

4. Experimental Results

The authors evaluated SpatialReward on three benchmarks and in an Online RL setting:

Benchmark Performance:
- EditReward-Bench: SpatialReward (8B) achieved 80.3% accuracy, outperforming the generative baseline EditScore-8B (69.0%) by +11.3% and surpassing the leading discriminative model EditReward.
- MMRB2: Achieved 66.1% (vs. 57.0% for EditScore), showing strong cross-image generalization.
- MER-Bench: Achieved 48.3% overall accuracy, outperforming EditScore-8B (35.0%) and competing closely with proprietary models like Gemini-2.5-Pro (46.2%). Notably, it excelled in the hardest "4-Pair" complex reasoning setting (21.5% vs. 19.5% for Gemini-3.0-Flash).
Online RL Application (OmniGen2):
- When used to train OmniGen2 via Flow-GRPO, SpatialReward improved performance on GEdit-Bench by +0.90 points.
- This gain was nearly double that of using GPT-4.1 as a reward signal (+0.45) and significantly higher than EditScore (+0.61).
- Qualitative Analysis: Models trained with SpatialReward successfully preserved source consistency (e.g., keeping poses and backgrounds intact) while executing edits, whereas models trained with EditReward often suffered from "content drift" and over-modification.
Efficiency: Despite being a generative model, SpatialReward achieved a 1.5x speedup over discriminative baselines (EditReward) in inference due to seamless integration with vLLM and PagedAttention.

5. Significance

This work fundamentally shifts the paradigm of image editing evaluation. It demonstrates that explicit spatial reasoning is not just a nice-to-have feature but a necessity for reliable alignment in generative AI. By forcing models to "look" at specific regions before judging, SpatialReward bridges the gap between high-level semantic instructions and low-level pixel consistency.

The results suggest that for Online RL to effectively optimize complex image editing tasks, reward models must move beyond holistic scoring and adopt structured, spatially-grounded verification. This approach not only improves evaluation metrics but also directly enhances the quality of generated images by preventing the "hallucination" of source content destruction.