From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

The paper proposes CoR-Painter, a novel framework that enhances autoregressive image generation by introducing a "How-to-What" paradigm with constrained reasoning to explicitly derive spatial and compositional rules before generating detailed descriptions, thereby achieving state-of-the-art performance in spatial accuracy and coherence.

Ruxue Yan, Xubo Liu, Wenya Guo, Zhengkun Zhang, Ying Zhang, Xiaojie Yuan

Published 2026-03-04
📖 3 min read☕ Coffee break read

Imagine you are asking a talented but slightly scatterbrained artist to paint a picture based on a simple sentence: "Put a blue water bottle on a red backpack."

The Old Way (The "What" Problem):
In the past, AI models acted like artists who only heard the what. They would think, "Okay, I need a blue bottle. I need a red backpack. I need to make them look realistic." They would start painting details immediately.

But because they didn't plan the how, things went wrong. The artist might paint the bottle floating in mid-air, or worse, paint two bottles and two backpacks all squashed together in a messy pile because the instruction "on top of" was vague. The result? A picture that looks okay up close, but makes no sense when you step back. It's like building a house by laying bricks randomly without a blueprint; the walls might be nice, but the roof might end up on the floor.

The New Way (CoR-Painter: The "How-to-What" Solution):
This paper introduces a new method called CoR-Painter. Instead of rushing to paint, the AI now acts like a master architect who follows a strict two-step process:

1. The "How" Phase (The Architect's Blueprint)

Before drawing a single line, the AI pauses to create a blueprint of constraints. It asks itself:

  • How should these objects relate? (The bottle must be strictly on top, not touching the sides).
  • How should the scene be lit? (Sunny day, outdoor).
  • How should the textures look? (Smooth bottle, rough fabric).

Think of this as the artist sketching a light pencil outline and writing notes like "Bottle goes here, Backpack goes there, no overlapping." This step forces the AI to understand the structure of the image before worrying about the details.

2. The "What" Phase (The Painter's Brushstrokes)

Once the blueprint is solid, the AI moves to the What. Now it knows exactly where to put the blue bottle and the red backpack. It fills in the details: "The bottle is translucent," "The backpack has a zipper," etc. Because the "How" was already solved, the "What" fits together perfectly.

The Secret Sauce: The "Tough Coach" (Dual-Objective GRPO)

How do we teach an AI to do this? The authors use a training method called Dual-Objective GRPO. Imagine a coach training an athlete in two different sports at once:

  • Coach A (The Logic Coach): Watches the AI's "blueprint" (the text reasoning). If the AI forgets to say the bottle is on the backpack, Coach A gives a penalty. This ensures the logic is sound.
  • Coach B (The Art Coach): Looks at the final painting. If the painting doesn't match the blueprint, Coach B gives a penalty. This ensures the picture looks good.

By training the AI with both coaches simultaneously, the model learns that good logic leads to good art.

Why This Matters

The paper shows that this method is a game-changer for spatial reasoning.

  • Old AI: "Here is a bottle and a backpack... oh wait, they are melting into each other."
  • CoR-Painter: "Here is a bottle sitting neatly on a backpack, exactly as requested."

In short, CoR-Painter stops AI from just guessing what to draw and starts teaching it how to organize a scene. It's the difference between a chaotic scribble and a well-composed masterpiece.