From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

Imagine you are asking a talented but slightly scatterbrained artist to paint a picture based on a simple sentence: "Put a blue water bottle on a red backpack."

The Old Way (The "What" Problem):
In the past, AI models acted like artists who only heard the what. They would think, "Okay, I need a blue bottle. I need a red backpack. I need to make them look realistic." They would start painting details immediately.

But because they didn't plan the how, things went wrong. The artist might paint the bottle floating in mid-air, or worse, paint two bottles and two backpacks all squashed together in a messy pile because the instruction "on top of" was vague. The result? A picture that looks okay up close, but makes no sense when you step back. It's like building a house by laying bricks randomly without a blueprint; the walls might be nice, but the roof might end up on the floor.

The New Way (CoR-Painter: The "How-to-What" Solution):
This paper introduces a new method called CoR-Painter. Instead of rushing to paint, the AI now acts like a master architect who follows a strict two-step process:

1. The "How" Phase (The Architect's Blueprint)

Before drawing a single line, the AI pauses to create a blueprint of constraints. It asks itself:

How should these objects relate? (The bottle must be strictly on top, not touching the sides).
How should the scene be lit? (Sunny day, outdoor).
How should the textures look? (Smooth bottle, rough fabric).

Think of this as the artist sketching a light pencil outline and writing notes like "Bottle goes here, Backpack goes there, no overlapping." This step forces the AI to understand the structure of the image before worrying about the details.

2. The "What" Phase (The Painter's Brushstrokes)

Once the blueprint is solid, the AI moves to the What. Now it knows exactly where to put the blue bottle and the red backpack. It fills in the details: "The bottle is translucent," "The backpack has a zipper," etc. Because the "How" was already solved, the "What" fits together perfectly.

The Secret Sauce: The "Tough Coach" (Dual-Objective GRPO)

How do we teach an AI to do this? The authors use a training method called Dual-Objective GRPO. Imagine a coach training an athlete in two different sports at once:

Coach A (The Logic Coach): Watches the AI's "blueprint" (the text reasoning). If the AI forgets to say the bottle is on the backpack, Coach A gives a penalty. This ensures the logic is sound.
Coach B (The Art Coach): Looks at the final painting. If the painting doesn't match the blueprint, Coach B gives a penalty. This ensures the picture looks good.

By training the AI with both coaches simultaneously, the model learns that good logic leads to good art.

Why This Matters

The paper shows that this method is a game-changer for spatial reasoning.

Old AI: "Here is a bottle and a backpack... oh wait, they are melting into each other."
CoR-Painter: "Here is a bottle sitting neatly on a backpack, exactly as requested."

In short, CoR-Painter stops AI from just guessing what to draw and starts teaching it how to organize a scene. It's the difference between a chaotic scribble and a well-composed masterpiece.

1. Problem Statement

Current autoregressive (AR) text-to-image generation methods, particularly those utilizing Chain-of-Thought (CoT) reasoning and Reinforcement Learning (RL), suffer from a fundamental limitation: they focus exclusively on "What" to depict (expanding the input prompt into detailed descriptions) but fail to reason about "How" to structure the image globally.

The Core Issue: Existing methods (e.g., T2I-R1) paraphrase prompts into detailed descriptions but lack a coherent generative blueprint for spatial organization.
Consequences: This leads to spatial ambiguity, resulting in unrealistic object overlaps, incorrect object counts, and logical inconsistencies in complex scenes. For example, a prompt might describe objects "on" each other, but without a structural constraint, the model generates redundant or overlapping objects that violate physical logic.
Gap: There is a lack of explicit reasoning regarding composition, layout, and spatial relationships before the generation of specific visual details.

2. Methodology: CoR-Painter

The authors propose CoR-Painter, a novel framework that shifts the paradigm from "What-to-What" to "How-to-What". The method is built on two main pillars: a constrained reasoning pipeline and a dual-objective optimization strategy.

A. The "How-to-What" Generation Pipeline

Inspired by human artistic processes (establishing composition before details), the model performs a two-stage textual reasoning process before generating the image:

Stage 1: "How to Draw" (Constraint Derivation):
- The model first analyzes the input prompt to deduce visual constraints.
- It explicitly identifies objects, attributes, and crucially, spatial relationships and compositional rules.
- Output: A set of structured instructions (e.g., "Object A must be placed neatly on top of Object B," "Background must be an outdoor setting").
Stage 2: "What to Draw" (Detailed Description):
- Guided by the constraints from Stage 1, the model generates a detailed visual description.
- This description incorporates the specific attributes and spatial arrangements defined in the previous step, ensuring global coherence.
- Output: A structured text sequence that serves as a direct mapping for image token generation.

B. Dual-Objective GRPO (DO-GRPO)

To optimize this two-stage process, the authors introduce Dual-Objective Group Relative Policy Optimization (DO-GRPO). Unlike standard GRPO which optimizes a single reward, DO-GRPO treats the textual reasoning and visual projection as two distinct but coupled objectives.

Objective 1: Textual Reasoning Optimization:
- Goal: Ensure the "How" and "What" text is semantically aligned with the prompt and logically coherent.
- Reward: Semantic Anchoring Reward ( $R_{SA}$ ). This checks if the model correctly extracts objects/attributes and if the reasoning chain follows the prompt's constraints.
Objective 2: Visual Projection Optimization:
- Goal: Ensure the generated image faithfully reflects the detailed description and spatial layout.
- Reward: Semantic Projection Reward ( $R_{SP}$ ). This evaluates the alignment between the generated text description and the final image.
Global Supervision:
- Holistic Alignment Reward ( $R_{HA}$ ): A global reward computed by evaluating the final image against the original prompt (using VQA, object detection, and human preference models) to ensure overall scene semantics and aesthetics.
Optimization Mechanism: The policy is updated using a clipped PPO-style objective where advantages are calculated separately for the text generation phase and the image generation phase, allowing for targeted improvements in both reasoning and synthesis.

3. Key Contributions

Paradigm Shift: Introduction of the "How-to-What" framework, which prioritizes structural constraints and spatial reasoning before detailing visual attributes, effectively bridging the gap between prompt understanding and coherent image synthesis.
Constrained Reasoning Framework: The design of a specific inference pipeline that forces the model to output explicit spatial and compositional constraints (Thought) before generating the descriptive text (Description).
Dual-Objective GRPO: A novel RL strategy that decouples and simultaneously optimizes textual reasoning accuracy and visual generation fidelity using modality-specific rewards ( $R_{SA}$ , $R_{SP}$ ) and a holistic reward ( $R_{HA}$ ).
State-of-the-Art Performance: Significant improvements in spatial reasoning and compositional tasks compared to existing CoT-based and AR-based methods.

4. Experimental Results

The method was evaluated on three major benchmarks: T2I-CompBench, GenEval, and WISE.

T2I-CompBench:
- CoR-Painter achieved State-of-the-Art (SOTA) performance across nearly all categories.
- Spatial Relationships: Achieved a +5.41% improvement over the previous best method (T2I-R1), demonstrating superior capability in handling object positioning and avoiding overlaps.
- Overall Score: 76.93 (vs. 72.43 for T2I-R1).
GenEval:
- Outperformed previous SOTA methods in spatial positioning tasks (surpassing Janus-FocusDiff by ~5%).
- Showed strong performance in attribute binding and multi-object generation, though counting metrics showed slight fluctuation due to the focus on global layout over strict quantity control.
WISE (World Knowledge):
- Demonstrated superior performance in knowledge-driven generation (e.g., cultural common sense, scientific reasoning), outperforming T2I-R1 and other baselines. This highlights the model's ability to infer implicit objects and scenarios through constraint-guided reasoning.
Ablation Studies:
- Removing the "Thought" (How) process significantly degraded spatial accuracy and semantic consistency.
- Removing specific rewards ( $R_{SA}$ or $R_{SP}$ ) led to drops in attribute binding and text-image alignment, confirming the necessity of the dual-objective design.

5. Significance

Solving Spatial Ambiguity: The paper addresses a critical flaw in current generative AI: the inability to logically structure complex scenes. By explicitly reasoning about "How" to arrange elements, CoR-Painter eliminates common artifacts like object merging and illogical overlaps.
Enhanced Reasoning in Vision: It demonstrates that integrating structured, constraint-based reasoning into autoregressive models significantly boosts their capability to handle complex, multi-object, and knowledge-intensive prompts.
Efficient Optimization: The DO-GRPO strategy provides a robust framework for training multimodal models where text reasoning and visual generation have distinct but interdependent requirements, offering a blueprint for future RL-based image generation research.

In summary, CoR-Painter represents a shift from simple prompt expansion to structured, constraint-driven generation, significantly improving the logical coherence and spatial accuracy of autoregressive image synthesis.

From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

1. The "How" Phase (The Architect's Blueprint)

2. The "What" Phase (The Painter's Brushstrokes)

The Secret Sauce: The "Tough Coach" (Dual-Objective GRPO)

Why This Matters

1. Problem Statement

2. Methodology: CoR-Painter

A. The "How-to-What" Generation Pipeline

B. Dual-Objective GRPO (DO-GRPO)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

A Learnable SIM Paradigm: Fundamentals, Training Techniques, and Applications

FED-HARGPT: A Hybrid Centralized-Federated Approach of a Transformer-based Architecture for Human Context Recognition

MuViS: Multimodal Virtual Sensing Benchmark

Coronary artery calcification assessment in National Lung Screening Trial CT images (DeepCAC2)