Enhancing Spatial Understanding in Image Generation via Reward Modeling

Imagine you are a master chef (the AI image generator) who can cook up beautiful, delicious-looking dishes (images) just by reading a recipe (the text prompt). For a long time, these chefs were amazing at making things look tasty and colorful. But they had a major blind spot: spatial logic.

If you asked them, "Put a red apple on the left, a blue cup on the right, and a banana under the cup," the chef might make a beautiful picture, but the banana might end up floating in the sky or the cup might be inside the apple. They understood the ingredients (objects) but not the table setting (where things go).

This paper introduces a solution to teach these AI chefs how to arrange their ingredients correctly. Here is the story of how they did it, broken down into three simple parts:

1. The Problem: The "Good Enough" Critic

Previously, when the AI made a picture, it was judged by a "critic" (a reward model). But these critics were like food critics who only cared about how pretty the plate looked.

The Flaw: If the AI put a cat on a dog's head, the critic might say, "Wow, great colors! 10/10!" even though a cat on a dog's head is physically impossible or weirdly placed.
The Result: The AI kept making mistakes because the critic wasn't punishing it for getting the positions wrong.

2. The Solution: Building a "Spatial Sense" Coach

The researchers decided to build a brand-new coach specifically trained to spot spatial errors. They did this in three steps:

Step A: The "Spot the Difference" Training Camp (The Dataset)

To train this new coach, they needed a massive library of examples. They created 80,000 pairs of images.

The Setup: They took a perfect recipe (e.g., "A lamp on the left, a plant on the right") and generated a perfect image.
The Trap: Then, they slightly messed up the recipe (e.g., "A lamp on the right, a plant on the left") and generated a "broken" image.
The Human Touch: Real humans looked at thousands of these pairs to make sure the "perfect" one was actually perfect and the "broken" one was actually broken. This created a high-quality "Spot the Difference" test for the AI.

Step B: The New Coach (SpatialScore)

They trained a new AI model, called SpatialScore, using this massive dataset. Think of this model as a strict geometry teacher.

What it does: Instead of just saying "Pretty picture," it asks, "Is the lamp actually on the left? Is the plant touching the floor?"
The Surprise: This new coach is so good at spotting spatial errors that it actually beats the most expensive, famous AI critics (like GPT-5) at this specific job. It's like hiring a math genius to grade a geometry test instead of an art teacher.

Step C: The "Top-K" Filter (The Smart Grading System)

Now, they used this new coach to teach the image generator (the chef) using a method called Reinforcement Learning.

The Problem: Sometimes, the chef tries 24 different versions of a picture. If the recipe is easy, all 24 might be good. If the recipe is hard, all 24 might be bad. This confuses the learning process (like a student getting a "B" on an easy test and a "B" on a hard test—they don't know how much to improve).
The Fix: The researchers added a "Top-K Filter." They told the system: "Ignore the middle-of-the-road attempts. Only look at the best 6 and the worst 6 attempts."
Why it works: By focusing only on the extremes (the best and the worst), the AI learns much faster what exactly makes a picture good or bad, without getting confused by average attempts.

3. The Result: A Master Arranger

After this training, the AI became a master of arrangement.

Before: If you asked for a complex scene with 5 objects in specific spots, the AI would often miss one or put them in the wrong place.
After: The AI now understands that "left" means left, "behind" means behind, and "between" means between. It can follow long, complicated instructions without getting the furniture floating in mid-air.

The Big Picture Analogy

Imagine you are teaching a robot to set a dinner table.

Old Way: You tell the robot, "Make a nice table." It puts the forks in the soup and the napkins on the ceiling. You say, "Nice colors!" and it keeps doing it.
New Way: You hire a strict butler (SpatialScore) who only cares about the rules of the table setting. You show the robot thousands of examples of "Right" vs. "Wrong" table settings. You then tell the robot to only learn from its best and worst attempts.
Outcome: The robot now sets the table perfectly, every time, no matter how many guests (objects) you invite.

In short: This paper built a specialized "spatial sense" teacher for AI, taught it using 80,000 carefully curated examples, and used a smart filtering trick to make the learning process super efficient. The result is AI that can finally understand where things belong in a picture.

1. Problem Statement

Recent text-to-image (T2I) generation models have achieved high visual fidelity but struggle with complex spatial relationships involving multiple objects.

The Challenge: When prompts contain intricate spatial instructions (e.g., "a cup to the left of a laptop, which is behind a book"), current models often fail to generate accurate layouts, requiring multiple sampling attempts to achieve satisfactory results.
Limitations of Existing Solutions:
- Existing Reward Models: Models like HPSv3, PickScore, and ImageReward focus on aesthetic quality and general text-image alignment but lack the fine-grained reasoning to evaluate specific spatial arrangements. They often assign high rewards to spatially incorrect images.
- Proprietary VLMs: While powerful, models like GPT-5 or Gemini-2.5 are too expensive for the frequent reward queries required in online Reinforcement Learning (RL).
- Open-Source VLMs: Large open-source models (e.g., Qwen2.5-VL-72B) suffer from hallucinations and unreliable spatial reasoning.
- Rule-Based Evaluators (GenEval): These rely on object detectors and simple templates. They fail to generalize to long, complex prompts and produce errors under visual challenges like occlusion.

2. Methodology

The authors propose a three-stage pipeline to enhance spatial understanding: constructing a specialized dataset, training a dedicated reward model, and applying it via online RL.

A. SpatialReward-Dataset Construction

Scale: An adversarial dataset containing 80,000 preference pairs.
Construction Process:
1. Prompt Generation: GPT-5 generates complex prompts with multiple objects and spatial relations.
2. Perturbation: GPT-5 modifies these prompts to create "perturbed" versions by altering specific spatial relations (e.g., swapping "left" and "right") while keeping other elements constant.
3. Image Generation: Both the original (perfect) and perturbed prompts are fed into state-of-the-art T2I models (Qwen-Image, HunyuanImage-2.1, Seedream 4.0) to generate image pairs.
4. Human Verification: Experts manually filter pairs to ensure the "perfect" image strictly adheres to the prompt and the "perturbed" image clearly violates the spatial constraints.
Characteristics: The dataset features significantly longer prompts and more complex multi-object relationships compared to existing benchmarks like GenEval.

B. SpatialScore (Reward Model)

Architecture: Built upon Qwen2.5-VL-7B (a Vision-Language Model).
- The original language modeling head is replaced with a linear Reward Head.
- A special token <reward> is inserted into the prompt to attend to both image and text features.
Training Objective:
- Uses LoRA for efficient fine-tuning.
- Models the reward score as a Gaussian distribution ( $s \sim N(\mu, \sigma^2)$ ) rather than a deterministic value, inspired by HPSv3, to improve ranking robustness.
- Optimized using the Bradley-Terry model with binary cross-entropy loss to maximize the probability of the preferred image ( $y_w$ ) scoring higher than the less preferred image ( $y_l$ ).

C. Online Reinforcement Learning with Top-k Filtering

Base Model: Flux.1-dev, chosen for its strong long-context capabilities.
Algorithm: Flow-GRPO (Group Relative Policy Optimization).
- Converts deterministic ODE sampling to stochastic SDE to facilitate policy exploration.
- Generates a group of $G$ images per prompt.
Top-k Filtering Strategy:
- Problem: In groups with easy prompts, high-reward samples may receive negative advantages due to a high group mean, causing training instability.
- Solution: Instead of using all $G$ samples, the method selects only the top- $k$ (highest reward) and bottom- $k$ (lowest reward) samples to compute the group mean and standard deviation for advantage calculation.
- Benefit: This balances the reward distribution, reduces advantage bias, and significantly lowers the Number of Function Evaluations (NFE) during policy updates.

3. Key Contributions

SpatialReward-Dataset: The first large-scale (80k pairs), human-curated adversarial dataset specifically designed for training reward models on complex spatial relationships.
SpatialScore: A specialized reward model that outperforms leading proprietary models (GPT-5, Gemini-2.5 Pro) in spatial evaluation accuracy, despite having a smaller parameter count (7B).
Top-k Filtering for RL: A novel strategy to mitigate advantage bias in online RL training, improving training efficiency and stability.
Empirical Validation: Demonstrated that using SpatialScore for RL fine-tuning yields substantial, consistent gains in spatial understanding across multiple benchmarks.

4. Experimental Results

Reward Model Performance

Benchmark: A 365-pair evaluation set covering various spatial perturbations.
Accuracy: SpatialScore achieved 95.8% pairwise accuracy.
Comparison:
- Surpassed GPT-5 (93.3%) and Gemini-2.5 Pro (95.1%).
- Significantly outperformed open-source VLMs (e.g., Qwen2.5-VL-72B at 76.4%) and existing image reward models (e.g., ImageReward at 60.2%).

Image Generation Performance

In-Domain (SpatialScore): The RL-tuned Flux model improved its spatial score from 2.18 (base) to 7.81.
Out-of-Domain Benchmarks:
- DPG-Bench: Improved overall score from 82.91 (base) to 85.03, approaching the performance of the proprietary GPT-Image-1 (85.15).
- TIIF-Bench & UniGenBench++: Showed consistent improvements on both short and long prompts, whereas models trained on GenEval degraded on long prompts.
Qualitative Results: The RL-trained model successfully generated complex scenes (e.g., specific object arrangements on a desk, tents with surrounding items) that the base model and GenEval-trained models failed to render correctly (often missing objects or placing them incorrectly).

Efficiency

The Top-k filtering strategy reduced the NFE per prompt from $24 \times 6$ (144) to $12 \times 6$ (72) or fewer, while maintaining or improving performance.

5. Significance

This work addresses a critical bottleneck in generative AI: spatial reasoning. By moving away from general aesthetic reward models and rule-based detectors, the authors demonstrate that:

Specialized Reward Models are essential for complex compositional tasks.
Human-Curated Adversarial Data is more effective than synthetic or rule-based data for training these models.
Online RL can be successfully applied to diffusion models for spatial control if paired with a reliable, cost-effective reward signal.

The proposed framework provides a scalable path for improving the controllability of T2I models, which is a prerequisite for more advanced applications like video generation and embodied AI, where precise spatial and temporal consistency is required.