Enhancing Spatial Understanding in Image Generation via Reward Modeling

This paper introduces a novel approach to enhance spatial understanding in text-to-image generation by constructing a large-scale preference dataset, developing a high-performance reward model called SpatialScore, and leveraging it for online reinforcement learning to significantly improve the accuracy of complex spatial relationships in generated images.

Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li, Rui Wang, Yunpeng Chen, Daquan Zhou

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine you are a master chef (the AI image generator) who can cook up beautiful, delicious-looking dishes (images) just by reading a recipe (the text prompt). For a long time, these chefs were amazing at making things look tasty and colorful. But they had a major blind spot: spatial logic.

If you asked them, "Put a red apple on the left, a blue cup on the right, and a banana under the cup," the chef might make a beautiful picture, but the banana might end up floating in the sky or the cup might be inside the apple. They understood the ingredients (objects) but not the table setting (where things go).

This paper introduces a solution to teach these AI chefs how to arrange their ingredients correctly. Here is the story of how they did it, broken down into three simple parts:

1. The Problem: The "Good Enough" Critic

Previously, when the AI made a picture, it was judged by a "critic" (a reward model). But these critics were like food critics who only cared about how pretty the plate looked.

  • The Flaw: If the AI put a cat on a dog's head, the critic might say, "Wow, great colors! 10/10!" even though a cat on a dog's head is physically impossible or weirdly placed.
  • The Result: The AI kept making mistakes because the critic wasn't punishing it for getting the positions wrong.

2. The Solution: Building a "Spatial Sense" Coach

The researchers decided to build a brand-new coach specifically trained to spot spatial errors. They did this in three steps:

Step A: The "Spot the Difference" Training Camp (The Dataset)

To train this new coach, they needed a massive library of examples. They created 80,000 pairs of images.

  • The Setup: They took a perfect recipe (e.g., "A lamp on the left, a plant on the right") and generated a perfect image.
  • The Trap: Then, they slightly messed up the recipe (e.g., "A lamp on the right, a plant on the left") and generated a "broken" image.
  • The Human Touch: Real humans looked at thousands of these pairs to make sure the "perfect" one was actually perfect and the "broken" one was actually broken. This created a high-quality "Spot the Difference" test for the AI.

Step B: The New Coach (SpatialScore)

They trained a new AI model, called SpatialScore, using this massive dataset. Think of this model as a strict geometry teacher.

  • What it does: Instead of just saying "Pretty picture," it asks, "Is the lamp actually on the left? Is the plant touching the floor?"
  • The Surprise: This new coach is so good at spotting spatial errors that it actually beats the most expensive, famous AI critics (like GPT-5) at this specific job. It's like hiring a math genius to grade a geometry test instead of an art teacher.

Step C: The "Top-K" Filter (The Smart Grading System)

Now, they used this new coach to teach the image generator (the chef) using a method called Reinforcement Learning.

  • The Problem: Sometimes, the chef tries 24 different versions of a picture. If the recipe is easy, all 24 might be good. If the recipe is hard, all 24 might be bad. This confuses the learning process (like a student getting a "B" on an easy test and a "B" on a hard test—they don't know how much to improve).
  • The Fix: The researchers added a "Top-K Filter." They told the system: "Ignore the middle-of-the-road attempts. Only look at the best 6 and the worst 6 attempts."
  • Why it works: By focusing only on the extremes (the best and the worst), the AI learns much faster what exactly makes a picture good or bad, without getting confused by average attempts.

3. The Result: A Master Arranger

After this training, the AI became a master of arrangement.

  • Before: If you asked for a complex scene with 5 objects in specific spots, the AI would often miss one or put them in the wrong place.
  • After: The AI now understands that "left" means left, "behind" means behind, and "between" means between. It can follow long, complicated instructions without getting the furniture floating in mid-air.

The Big Picture Analogy

Imagine you are teaching a robot to set a dinner table.

  • Old Way: You tell the robot, "Make a nice table." It puts the forks in the soup and the napkins on the ceiling. You say, "Nice colors!" and it keeps doing it.
  • New Way: You hire a strict butler (SpatialScore) who only cares about the rules of the table setting. You show the robot thousands of examples of "Right" vs. "Wrong" table settings. You then tell the robot to only learn from its best and worst attempts.
  • Outcome: The robot now sets the table perfectly, every time, no matter how many guests (objects) you invite.

In short: This paper built a specialized "spatial sense" teacher for AI, taught it using 80,000 carefully curated examples, and used a smart filtering trick to make the learning process super efficient. The result is AI that can finally understand where things belong in a picture.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →