Agentic Planning with Reasoning for Image Styling via Offline RL

Imagine you want to hire an artist to paint a picture of your living room, but instead of giving them a simple sketch, you have to describe the entire process of how to get there.

If you just say, "Make it look like a cozy, rainy afternoon with a fireplace," a standard AI might get confused. It might make the rain look like snow, or put the fireplace in the ceiling, or forget the cozy part entirely. It's like trying to give a complex order to a chef who only speaks in vague whispers.

This paper introduces a new way to teach AI how to edit images. Instead of just shouting a command, the AI learns to think, plan, and reason step-by-step before it touches the picture.

Here is the breakdown using simple analogies:

1. The Problem: The "Vague Chef"

Current AI image editors are like Chefs who guess. You give them a vague order ("Make this a winter wonderland"), and they try to guess what you mean. Sometimes they get it right, but often they mess up the details (like making the snow look like white paint instead of fluffy flakes, or changing the house into a castle). They lack a clear plan.

2. The Solution: The "Architect"

The authors built an AI that acts like an Architect before it acts like a painter.

Step 1: The Blueprint (Planning): Before changing a single pixel, the AI breaks your big request down into small, logical steps.
- Bad Request: "Make it look like a rainy cyberpunk city."
- The Architect's Plan:
  1. Change the weather to "heavy rain."
  2. Change the lighting to "neon blue and pink."
  3. Change the buildings to "futuristic metal."
  4. Add "wet pavement reflections."
Step 2: The Reasoning (The "Why"): Crucially, the AI doesn't just list steps; it explains why it's doing them. "I am changing the lighting to neon because cyberpunk cities are dark and artificial." This "Chain of Thought" helps the AI stay on track.

3. The Training: "Learning from the Best"

How did they teach this AI to be such a good Architect? They used a method called Offline Reinforcement Learning.

Imagine a cooking school where students don't just practice cooking; they watch a Master Chef cook thousands of meals.

The Teacher: A very smart (but expensive) AI generates thousands of "recipes" (plans) for editing images.
The Grading: A judge (another AI) tastes every dish and gives it a score from 0 to 5 stars.
The Student's Lesson: The student AI (the one we actually use) doesn't just copy every recipe.
- It ignores the burnt meals (bad plans).
- It studies the 3-star meals carefully.
- It obsesses over the 5-star meals, learning exactly what made them perfect.

The paper introduces two special ways to study these 5-star meals:

RW (Reward Weighted): Like a student who spends 1 hour studying a 3-star recipe but 5 hours studying a 5-star recipe. The better the recipe, the more the student learns from it.
SW (Standardized Reward Weighted): This is like a student who realizes, "Hey, this class is really hard, so a 4-star meal is actually amazing!" It adjusts its learning based on how difficult the task was, ensuring it learns the right lessons even when the "perfect" recipes are rare.

4. The Result: Small but Mighty

The most surprising part? They trained a small AI (4 billion or 8 billion "brain cells") using this method.

The Old Way: You needed a giant, expensive, closed-source AI (like GPT-4o) to get good results.
The New Way: Their small, open-source AI, because it learned how to plan, actually beat the giant AI in most tests.

The Analogy: It's like teaching a small, smart apprentice to be a master carpenter by showing them the blueprints of the best houses ever built. The apprentice doesn't need to be a giant to build a great house; they just need to know the plan.

Why This Matters

Control: You can finally get complex edits right (e.g., "Keep the dog, but make the background a desert sunset").
Transparency: You can see the AI's "thought process" (the plan) before it edits, so you know why it made a change.
Efficiency: You don't need a supercomputer to get professional results; a smaller, cheaper AI can do the job if it's taught to think first.

In short: This paper teaches AI to stop guessing and start planning. By breaking big, messy creative requests into small, reasoned steps and learning from the best examples, a small AI can now paint better pictures than the giants of the past.

1. Problem Statement

Current state-of-the-art image styling relies heavily on direct prompt-based editing, where users provide natural language instructions to foundation models (e.g., Stable Diffusion, DALL-E 3) to generate or modify images. While effective for simple tasks, this approach fails on complex, multi-dimensional transformations (e.g., "Transform to a golden-hour winter wonderland with magical snowfall while preserving architectural details").

Key Limitations of Direct Editing:

Ambiguity: Natural language prompts are often vague and subjective, failing to explicitly specify which visual dimensions to change, in what order, or how to balance competing constraints.
Instruction Adherence: Models often struggle to coordinate changes across multiple attributes (lighting, season, weather, architecture) simultaneously, leading to inconsistent results, structural artifacts, or failure to follow specific preservation constraints.
Lack of Reasoning: Direct editing lacks an intermediate reasoning step, making it difficult to decompose complex goals into actionable, atomic steps.

2. Methodology

The authors propose a Tool-Based Agentic RL Post-Training Framework that shifts the paradigm from direct prompt-to-image generation to structured planning with explicit reasoning. The core idea is to train a "planner" model to decompose high-level aesthetic intents into sequences of tool calls, which are then synthesized into precise instructions for a frozen image editor.

A. Four-Stage Structured Editing Pipeline

The framework operates through four distinct stages:

Structured Context Extraction: A vision-language model extracts a structured text representation ( $c_i$ ) of the input image across 10 orthogonal dimensions (e.g., location, architecture, time of day, season, weather, mood lighting, color grading, artistic medium, atmospheric effects). This grounds the planning process in explicit visual attributes rather than implicit understanding.
Action Planning with Reasoning: The agent generates a sequence of actions ( $a_{i,j}$ ) and corresponding Chain-of-Thought (CoT) reasoning ( $z_{i,j}$ ) for each step. The reasoning explains why a specific tool is chosen and how it contributes to the overall goal (e.g., "Setting golden-hour lighting creates warm tones...").
Instruction Synthesis: The sequence of actions and reasoning is synthesized into a precise, natural language editing instruction ( $\hat{e}_i$ ).
Image Rendering: A frozen black-box image editor (Qwen-Image-Edit) executes the synthesized instruction to produce the final image. The planner is trained to optimize the planning quality, not the pixel generation itself.

B. Synthetic Data Generation

Since no existing datasets provide explicit tool-based styling supervision with reasoning chains, the authors developed a teacher-student pipeline to generate three large-scale synthetic datasets (approx. 10k trajectories each):

Simple: 1-2 atomic actions.
Regular: 3-5 compositional actions across 10 interior design themes.
Complex: 3-5 actions across 83 diverse themes with strict preservation constraints.
Teacher Model: Qwen3-VL-8B-Instruct generates the plans, reasoning, and synthesized instructions.
Reward Evaluation: The teacher evaluates the final trajectory quality on a 0-5 scale across 17 dimensions (including goal alignment, aesthetic quality, and spatial consistency).

C. Offline RL Training Algorithms

The paper introduces and compares several offline reinforcement learning methods to train student planners (Qwen3-VL 4B/8B) on the synthetic data:

Standard Supervised Learning (S): Treats all trajectories equally, ignoring reward signals.
Reward-Filtered (R): Discards low-quality trajectories (threshold $r < 4.0$ ) and trains on the remaining high-quality data.
Reward-Weighted (RW): Uses all trajectories but weights the gradient contribution of each sample by its quality score ( $w(r) = \max\{r - 3.0, 0\}$ ). This preserves data diversity while emphasizing high-quality examples.
Standardized Reward-Weighted (SW): An extension of RW that normalizes rewards via z-score standardization before weighting. This reduces gradient variance and adapts to different reward distributions across inputs.
Direct Preference Optimization (DPO): Learns from pairwise comparisons of "chosen" (high reward) vs. "rejected" (low reward) trajectories without an explicit reward model.

3. Key Contributions

Tool-Based Agentic Framework: A novel methodology combining a compositional library of 10-30 orthogonal primitive tools, structured context representation, and per-step CoT reasoning to decompose complex styling tasks.
Large-Scale Synthetic Datasets: The creation and public release of three datasets (Simple, Regular, Complex) containing ~30,000 trajectories with structured context, multi-step plans, reasoning chains, and quality scores.
Reward-Aware Training Methods: The proposal and empirical validation of RW and SW training methods. The paper demonstrates that weighting trajectories by quality (especially with standardization) is crucial for learning effective compositional planning, outperforming standard SFT and simple filtering.
Comprehensive Evaluation: Extensive experiments on 4B and 8B parameter models showing that compact, open-source planners trained with offline RL can outperform the much larger, closed-source GPT-4o zero-shot baseline in image quality and instruction following.

4. Experimental Results

The authors evaluated their methods across 12 configurations (3 datasets $\times$ 2 model sizes $\times$ 2 modalities: text-only and vision-language).

Performance vs. Baselines:
- Edit-Only (Direct Prompting): Consistently underperformed, confirming that structured planning is essential for complex tasks.
- GPT-4o Zero-Shot: The trained 4B/8B models outperformed GPT-4o in 10 out of 11 configurations regarding image quality.
Method Effectiveness by Task:
- SW (Standardized Reward-Weighted): Achieved the highest scores on compositional text tasks (Regular Text-4B: 78.77, Text-8B: 77.86), excelling in semantic accuracy and instruction following.
- RW (Reward-Weighted): Dominated on simple vision tasks (Simple Vision-4B: 79.33), leveraging visual grounding effectively.
- DPO: Performed best on Complex Vision-8B with diverse themes (85.41), benefiting from broad distribution coverage and fine-grained preference learning.
Reasoning Quality: Models trained with reward-aware methods (RW, SW, DPO) generated significantly higher-quality reasoning chains and action plans compared to standard supervised learning, with a strong correlation between reasoning quality and final image quality.

5. Significance and Impact

Bridging the Gap: The work demonstrates that structured planning with explicit reasoning is a superior paradigm for complex image editing compared to direct prompt-to-image generation. It allows smaller models to outperform massive general-purpose models by specializing in the planning logic.
Data Efficiency: The use of offline RL with reward weighting (RW/SW) allows models to learn effectively from synthetic data without the computational cost of online RL or the need for massive human-labeled datasets.
Interpretability: By decomposing edits into tool calls with reasoning, the system provides transparency into the editing process, making it easier to debug, control, and verify (e.g., identifying exactly which step failed to preserve a constraint).
Open Science: The release of the datasets, code, and training pipelines provides a blueprint for building agentic systems in creative domains, moving beyond "black-box" generation toward controllable, reasoning-driven agents.

In conclusion, the paper establishes that reward-aware offline RL training of agentic planners is a highly effective strategy for complex image styling, enabling compact models to achieve state-of-the-art results that surpass larger, zero-shot baselines.