Order Is Not Layout: Order-to-Space Bias in Image Generation

Imagine you have a very talented, super-fast artist who can draw anything you describe. You tell them, "Draw a cat and a dog," and they instantly create a beautiful picture. But here's the catch: this artist has a weird, invisible habit.

The Problem: The "First-Mentioned" Rule
No matter what you actually want, this artist always puts the first thing you mention on the left side of the picture and the second thing on the right.

It doesn't matter if you say, "Draw a clock with the number 3 on the right and 9 on the left." If you say "3 and 9," the artist will stubbornly put the 3 on the left and the 9 on the right, even though that's wrong for a real clock. They are ignoring the logic of the real world and just following the order of your words like a robot.

The paper calls this "Order-to-Space Bias." It's like the artist thinks, "You said 'A' first, so 'A' must go on the left. You said 'B' second, so 'B' must go on the right." They treat the order of your words as a strict map, even when the map is wrong.

Why is this happening?

Think of the artist's training data as a massive library of millions of old photos and their captions. In real life, when people describe photos, they often naturally say, "Here is a tree on the left and a house on the right." Because the artist read so many of these, they learned a shortcut: "First word = Left side."

They didn't learn to look at the world to figure out where things go; they learned to listen to the order of words. It's like a student who memorized that "Question 1 is always on the left page" instead of actually learning how to read the book.

The Experiment: The "Swap Test"

The researchers built a special test called OTS-BENCH to catch this behavior. It's like a magic trick:

The Setup: They ask the artist to draw two things, like a "Left-Hand" and a "Right-Hand."
The Trick: They ask the same question twice, but they swap the order of the words.
- Prompt A: "Draw a Left-Hand and a Right-Hand."
- Prompt B: "Draw a Right-Hand and a Left-Hand."
The Result: If the artist is smart, both pictures should look the same (Left-Hand on the left, Right-Hand on the right). But because of the bias, the artist draws the Left-Hand on the left for both prompts. In the second prompt, they accidentally put the "Right-Hand" on the left side, which is confusing and wrong!

They found that almost all modern AI artists do this. They are so obsessed with the order of words that they break the rules of reality.

The Fix: Breaking the Habit

The researchers discovered that this "bad habit" happens very early in the drawing process. It's like the artist decides the layout of the room in the first few seconds of sketching, before they even start adding details.

They tried two ways to fix it:

The "Mirror Trick" (Fine-Tuning): They showed the artist the same picture twice. Once normally, and once flipped like a mirror image, but with the same caption. This forced the artist to realize, "Hey, the words 'Cat and Dog' can mean a cat on the left OR a cat on the right!" This broke the rigid connection between word order and position.
The "Blind Start" (Timing): They told the artist to start drawing the scene without naming the specific animals first. They said, "Draw two animals," let the artist decide where they go, and then said, "Okay, now make the left one a cat and the right one a dog." This stopped the artist from locking the positions too early based on word order.

The Takeaway

This paper is a wake-up call. It shows that even our most advanced AI artists are taking shortcuts. They aren't truly understanding the world; they are just following a pattern they learned from the internet.

The Moral of the Story: Just because you say "A and B" doesn't mean A has to be on the left. We need to teach AI to look at the meaning of the sentence, not just the order of the words, so they can draw the world correctly, not just the way we happened to type it.

Here is a detailed technical summary of the paper "Order Is Not Layout: Order-to-Space Bias in Image Generation."

1. Problem Definition: Order-to-Space Bias (OTS)

The paper identifies a systematic flaw in modern text-to-image (T2I) and image-to-image (I2I) diffusion models known as Order-to-Space Bias (OTS).

The Phenomenon: Models erroneously treat the sequential order of entity mentions in a text prompt as a deterministic rule for spatial layout (left-to-right positioning) and entity-role binding (assigning actions/attributes).
The Consequence:
- Layout Errors: In neutral prompts (e.g., "a boy is chasing a girl"), the first-mentioned entity is consistently placed on the left, regardless of real-world logic or lack of spatial cues.
- Semantic Failures: When text order conflicts with grounded constraints (e.g., "digit 3 and digit 9 on a clock"), the model prioritizes text order over reality, placing '3' to the left of '9' despite the clockwise convention.
- Role Inversion: In I2I tasks, if a prompt specifies "a teacher pointing at a student" but the input image has the teacher on the right, the model often swaps the roles to match the text order, violating visual grounding.
Root Cause: The bias is primarily data-driven. Analysis of web-scale caption-image pairs (LAION-2B, DataComp) reveals a strong statistical correlation where the first-mentioned entity in captions frequently appears on the left in the corresponding images. Models learn this as a "shortcut" for spatial reasoning.

2. Methodology: OTS-BENCH and Evaluation

To quantify and analyze this bias, the authors introduce OTS-BENCH, a controlled benchmark comprising 4,300 test cases across T2I and I2I modalities.

A. Benchmark Construction

Data: Curated from 138 subjects (humans, animals, objects) and 172 actions/states.
Design: Uses paired prompts that differ only in the order of entity mentions to isolate order effects.
Evaluation Dimensions:
1. Homogenization: Measures the extent to which models default to a specific layout (e.g., First-Mentioned-Left) under neutral or under-specified prompts.
  - Metric: Deviation from a balanced 50/50 distribution of left/right placements.
2. Correctness: Measures whether models respect real-world constraints (T2I) or seed-image grounding (I2I) when text order contradicts them.
  - Setup: Compares Aligned (text order matches reality) vs. Reverse (text order contradicts reality) prompts.

B. Experimental Setup

Models Evaluated: 9 state-of-the-art models, including SDXL, SD3.5, FLUX-dev, DALL-E 3, Midjourney v7, Kling-v2, and GPT-Image.
Evaluation Agent: A Vision-Language Model (Qwen3-VL-8B-Instruct) was selected as the automated judge after validation against human annotations (Cohen's $\kappa = 0.81$ ).

C. Mechanism Analysis

Temporal Localization: The authors investigated when during the diffusion sampling process the bias manifests. By switching the entity order in the prompt at different denoising steps, they found that order cues dominate only during the early stages (layout formation). Once the global structure is established, changing the order has negligible effect.

3. Key Contributions

Identification of OTS: First systematic characterization of how mention order spuriously dictates spatial layout and role binding in generative models.
OTS-BENCH: A new benchmark specifically designed to isolate order effects from other compositional challenges, evaluating both homogenization and correctness.
Empirical Evidence: Large-scale evaluation proving OTS is pervasive across all major models and is rooted in training data correlations.
Mitigation Strategies:
- Data-Level: Targeted fine-tuning (LoRA-SFT) using horizontally flipped image pairs to break the order-space correlation.
- Inference-Level: Delayed Order Conditioning, a strategy where the model generates the initial layout using a neutral prompt (anonymizing entities) and only introduces specific entity names in later denoising steps.

4. Key Results

Prevalence: OTS is widespread. In T2I, homogenization scores (bias strength) ranged from 52.6% to 91.6% across models, indicating a strong tendency to follow text order.
Correctness Collapse: When text order conflicted with real-world constraints (Reverse setting), T2I correctness dropped drastically (e.g., from 90% in Aligned to **20%** in Reverse for some models).
Data Origin: Web-scale data analysis showed an 87–89% alignment rate between caption order and image layout, confirming the statistical origin of the bias.
Temporal Sensitivity: Interventions switching prompt order were only effective if performed in the early denoising steps (e.g., step 1–2), confirming that layout formation happens early in the diffusion process.
Mitigation Success:
- Fine-tuning: Reduced homogenization scores significantly (e.g., FLUX-dev dropped from 88.8 to 47.4) while maintaining image quality (ImageReward scores remained stable).
- Delayed Conditioning: Switching from a neutral to a rich prompt at step 2 reduced homogenization effectively, proving that delaying entity-specific cues prevents the model from locking into an order-based layout.

5. Significance and Impact

Beyond Demographic Bias: While prior work focused on demographic or cultural biases, this paper highlights a fundamental spatial reasoning failure that affects the structural integrity of generated images.
Reliability: OTS causes critical failures in applications requiring precise spatial control (e.g., UI design, instructional diagrams, or multi-entity storytelling).
Generalizability: The findings suggest that "sequence-induced priors" are a broader issue in vision-language systems, potentially affecting video generation and captioning as well.
Practical Solutions: The paper provides actionable, low-cost interventions (flipped data augmentation and inference scheduling) that can be applied to existing models to improve spatial grounding without retraining from scratch.

In summary, the paper demonstrates that current image generators rely on a spurious statistical shortcut—mapping text order to spatial position—rather than true semantic understanding. By exposing this bias and offering effective mitigation strategies, the work advances the reliability and controllability of generative AI.

Order Is Not Layout: Order-to-Space Bias in Image Generation

Why is this happening?

The Experiment: The "Swap Test"

The Fix: Breaking the Habit

The Takeaway

1. Problem Definition: Order-to-Space Bias (OTS)

2. Methodology: OTS-BENCH and Evaluation

A. Benchmark Construction

B. Experimental Setup

C. Mechanism Analysis

3. Key Contributions

4. Key Results

5. Significance and Impact

More like this

RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum

Reasoning Models Struggle to Control their Chains of Thought

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks