Order Is Not Layout: Order-to-Space Bias in Image Generation

This paper identifies and quantifies "Order-to-Space Bias" (OTS), a systematic flaw in modern image generation models where the textual order of entities incorrectly dictates their spatial layout, and demonstrates that this data-driven issue can be effectively mitigated through targeted fine-tuning and early-stage interventions without compromising generation quality.

Yongkang Zhang, Zonglin Zhao, Yuechen Zhang, Fei Ding, Pei Li, Wenxuan Wang

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you have a very talented, super-fast artist who can draw anything you describe. You tell them, "Draw a cat and a dog," and they instantly create a beautiful picture. But here's the catch: this artist has a weird, invisible habit.

The Problem: The "First-Mentioned" Rule
No matter what you actually want, this artist always puts the first thing you mention on the left side of the picture and the second thing on the right.

It doesn't matter if you say, "Draw a clock with the number 3 on the right and 9 on the left." If you say "3 and 9," the artist will stubbornly put the 3 on the left and the 9 on the right, even though that's wrong for a real clock. They are ignoring the logic of the real world and just following the order of your words like a robot.

The paper calls this "Order-to-Space Bias." It's like the artist thinks, "You said 'A' first, so 'A' must go on the left. You said 'B' second, so 'B' must go on the right." They treat the order of your words as a strict map, even when the map is wrong.

Why is this happening?

Think of the artist's training data as a massive library of millions of old photos and their captions. In real life, when people describe photos, they often naturally say, "Here is a tree on the left and a house on the right." Because the artist read so many of these, they learned a shortcut: "First word = Left side."

They didn't learn to look at the world to figure out where things go; they learned to listen to the order of words. It's like a student who memorized that "Question 1 is always on the left page" instead of actually learning how to read the book.

The Experiment: The "Swap Test"

The researchers built a special test called OTS-BENCH to catch this behavior. It's like a magic trick:

  1. The Setup: They ask the artist to draw two things, like a "Left-Hand" and a "Right-Hand."
  2. The Trick: They ask the same question twice, but they swap the order of the words.
    • Prompt A: "Draw a Left-Hand and a Right-Hand."
    • Prompt B: "Draw a Right-Hand and a Left-Hand."
  3. The Result: If the artist is smart, both pictures should look the same (Left-Hand on the left, Right-Hand on the right). But because of the bias, the artist draws the Left-Hand on the left for both prompts. In the second prompt, they accidentally put the "Right-Hand" on the left side, which is confusing and wrong!

They found that almost all modern AI artists do this. They are so obsessed with the order of words that they break the rules of reality.

The Fix: Breaking the Habit

The researchers discovered that this "bad habit" happens very early in the drawing process. It's like the artist decides the layout of the room in the first few seconds of sketching, before they even start adding details.

They tried two ways to fix it:

  1. The "Mirror Trick" (Fine-Tuning): They showed the artist the same picture twice. Once normally, and once flipped like a mirror image, but with the same caption. This forced the artist to realize, "Hey, the words 'Cat and Dog' can mean a cat on the left OR a cat on the right!" This broke the rigid connection between word order and position.
  2. The "Blind Start" (Timing): They told the artist to start drawing the scene without naming the specific animals first. They said, "Draw two animals," let the artist decide where they go, and then said, "Okay, now make the left one a cat and the right one a dog." This stopped the artist from locking the positions too early based on word order.

The Takeaway

This paper is a wake-up call. It shows that even our most advanced AI artists are taking shortcuts. They aren't truly understanding the world; they are just following a pattern they learned from the internet.

The Moral of the Story: Just because you say "A and B" doesn't mean A has to be on the left. We need to teach AI to look at the meaning of the sentence, not just the order of the words, so they can draw the world correctly, not just the way we happened to type it.