VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model

Imagine you are an art director hiring a team of AI artists to create a complex comic book. You don't just want one picture; you need a whole story with consistent characters, specific scenes, and a clear plot.

For a long time, AI artists were like brilliant but impulsive painters. They could paint a single, stunning portrait instantly. But if you asked them to paint a 10-page story, they would often get lost. They'd draw a hero in Chapter 1, then forget what the hero looked like in Chapter 5, or draw a background that didn't match the mood. They were "plan-driven," meaning they tried to follow a script, but if they made a small mistake early on, they couldn't stop to fix it. They just kept painting over the error, making the whole story worse.

This paper introduces VisionCreator-R1, a new kind of AI artist that has learned a superpower: Reflection.

The Problem: The "Blind Painter" vs. The "Self-Correcting Artist"

Think of the old AI agents as blind painters. They are given a list of instructions (a plan) and they execute it step-by-step.

Step 1: Paint a cat. (Done perfectly).
Step 2: Paint a dog next to it. (The dog looks like a blob).
Step 3: Paint a tree. (The tree is upside down).

The old AI doesn't notice the blob or the upside-down tree. It just keeps going because it's focused on "finishing the plan." By the end, the comic book is a mess.

The new VisionCreator-R1 is like a self-correcting artist. It paints a step, then pauses to look at its own work.

"Wait, this dog looks like a blob. I need to fix that before I paint the tree."
It erases the blob, paints a real dog, and then moves on.

The Big Discovery: Why is this so hard?

The researchers found a tricky problem. Teaching an AI to "pause and reflect" is easy for single pictures (like painting one cat). But it's incredibly hard for long stories (multi-image workflows).

Why? They used a math analogy to explain this:

Planning is like giving directions to a GPS. "Turn left, then right." If the GPS says "Turn left," you know immediately if you did it right. The feedback is clear and instant.
Reflection in a long story is like trying to judge a single brushstroke in a painting that is still drying. The final result depends on everything that happened before, plus the random "noise" of the paint drying. If the final picture is bad, the AI doesn't know: Did I make a bad reflection decision? Or was the paint just messy?

This "noise" makes it hard for the AI to learn how to reflect in long tasks. It's like trying to learn to juggle while standing on a shaking boat; you can't tell if you dropped the ball because your hands slipped or because the boat shook.

The Solution: The "Decouple-Then-Fuse" Strategy

To fix this, the team invented a training method called RPCO (Reflection–Plan Co-Optimization). Think of it as a three-step apprenticeship:

Stage 1: The Solo Practice (Single Images)
First, they taught the AI to reflect only on single pictures. Since there's no complex story to mess things up, the AI learned to spot errors and fix them perfectly. It became a master of "Self-Correction" for simple tasks.
Stage 2: The Master Planner (Multi-Image)
Separately, they trained the AI to be a great "Planner" for long stories. This version learned how to break down complex tasks into logical steps, but it didn't have the self-correction skill yet.
Stage 3: The Perfect Marriage (Co-Optimization)
Finally, they combined the two. They took the "Self-Correction" skills from Stage 1 and the "Planning" skills from Stage 2 and fused them together.
- Because the AI already knew how to reflect (from Stage 1), it didn't get confused by the "shaking boat" of the long story.
- Because it had a strong plan (from Stage 2), it knew when to stop and reflect.

The Result: A Super-Artist

The result, VisionCreator-R1, is an AI that can handle both simple tasks and complex, multi-step stories better than any current AI (even beating the very smart "Gemini 2.5 Pro").

Without Reflection: The AI paints a messy comic book and calls it a day.
With VisionCreator-R1: The AI paints a step, checks it, fixes the mistakes, plans the next scene, and delivers a perfect comic book where the characters look the same in every frame and the story makes sense.

In a Nutshell

This paper teaches us that to make AI truly creative and reliable, we can't just tell it to "follow the plan." We have to teach it to look back, admit mistakes, and fix them along the way. And to do that, we have to teach it to fix small things first before asking it to fix big, complicated stories.

Here is a detailed technical summary of the paper "VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model."

1. Problem Statement

Existing visual generation agents face significant limitations in handling complex, long-horizon workflows (e.g., multi-image generation or video creation):

Plan-Driven Limitations: Current agents (like VisionCreator) are primarily "plan-driven," prioritizing the rationality of tool calls and procedural steps. They lack a systematic mechanism for structured reflection to correct visual errors mid-trajectory.
Error Accumulation: Without reflection, minor deviations in early stages propagate unchecked, leading to severe error accumulation in multi-step tasks.
Optimization Asymmetry: While single-image editing benefits from reflection, extending this to multi-image workflows via Reinforcement Learning (RL) fails. The paper identifies a fundamental structural variance asymmetry: planning rewards are deterministic and stable, whereas reflection rewards in multi-image tasks are noisy due to the stochasticity of image generation and tool execution. This "collapsed signal-to-noise ratio" makes it difficult for RL to learn effective reflection policies in long-horizon scenarios.

2. Methodology: Reflection–Plan Co-Optimization (RPCO)

To address the optimization asymmetry, the authors propose VisionCreator-R1, a native visual agent trained via a novel Reflection–Plan Co-Optimization (RPCO) methodology. The framework follows a "decouple-then-fuse" strategy across three stages:

A. Theoretical Insight: Structural Variance Asymmetry

The authors prove (Theorem 3.1) that in GRPO (Group Relative Policy Optimization):

Planning Rewards ( $R_{plan}$ ): Evaluated deterministically based on logical coherence. The trajectory variance ( $\Sigma_\tau$ ) is negligible, leading to stable optimization.
Reflection Rewards ( $R_{reflect}$ ): Dependent on stochastic downstream visual outcomes (image generation). The trajectory variance is dominant ( $\Sigma_\tau \gg \Sigma_a$ ), drowning out the learning signal.
Conclusion: Directly transferring single-image reflection capabilities to multi-image tasks via RL is infeasible due to this noise.

B. The RPCO Training Paradigm

Stage 1: Isolating Reflection (Single-Image):
- Train on single-image tasks where planning demand is minimal.
- Use Supervised Fine-Tuning (SFT) on high-quality trajectories followed by RL with a visual reflection reward ( $R_{reflect}$ ).
- Result: A Strong-Reflection model that outperforms baselines in single-image tasks, proving reflection is learnable in low-noise environments.
Stage 2: Advantage-Complementary SFT (Decoupling):
- Construct a hybrid dataset (VCR-SFT) combining:
  - Reflection-strong trajectories from the Stage 1 model.
  - Planning-strong trajectories from expert models (Gemini2.5Pro).
- Perform SFT to create a Reflection-Plan SFT model. This establishes a balanced prior, ensuring the agent has both robust planning and high-quality reflection patterns before RL.
Stage 3: Multi-Task RL (Fusing):
- Apply multi-task RL on the VCR-RL dataset (covering single and multi-image tasks).
- Use a unified reward system ( $R_{total}$ ) that includes Plan Reward (logical coherence) and Reflection Reward (visual quality).
- Outcome: The planning capability continues to improve via stable plan rewards, while the reflection capability learned during SFT is preserved and refined, avoiding the collapse seen in direct RL transfer.

C. Reward Design

The agent utilizes a multi-dimensional reward system:

$R_{plan}$ : Evaluates logical coherence and tool matching (deterministic).
$R_{reflect}$ : Uses a VLM judge to score post-reflection images against specific checkpoints (subject correctness, style, etc.).
$R_{format}$ & $R_{tool}$ : Ensure structural correctness (UTPCR tags) and successful tool execution.
$R_{result}$ : Verifies quantitative constraints (e.g., correct number of images).

3. Key Contributions

Identification of Optimization Asymmetry: The paper theoretically and empirically demonstrates why reflection fails in multi-image RL settings due to high trajectory variance, a critical insight for future agentic research.
VisionCreator-R1 Agent: A native visual generation agent that integrates Understanding, Thinking, Planning, Creation, and Reflection (UTPCR) into a single trainable framework.
RPCO Methodology: A novel training strategy that isolates reflection learning in low-noise settings before fusing it with planning via multi-task RL, effectively solving the variance asymmetry problem.
VCR Dataset & Benchmark:
- VCR-SFT & VCR-RL: Curated datasets for SFT and RL, containing reflection-strong and planning-strong trajectories.
- VCR-Bench: A standardized benchmark covering single-image, multi-image, and image-to-image tasks, evaluated by both VLMs and human annotators.

4. Experimental Results

The model was evaluated against strong baselines, including Gemini2.5Pro and Qwen3VL32B.

Performance on VCR-Bench:
- Multi-Image Tasks: VisionCreator-R1 achieved a score of 0.700, significantly outperforming Gemini2.5Pro (0.649). This confirms that co-optimization enables better long-horizon reasoning and error correction.
- Single-Image Tasks: Scored 0.532, surpassing Gemini2.5Pro (0.515).
- Image-to-Image: Scored 0.836, beating Gemini2.5Pro (0.816).
Human Evaluation: VisionCreator-R1 was preferred over Gemini2.5Pro in 14.8% of single-image tasks and 9.3% of multi-image tasks.
Ablation Studies:
- Models trained via direct RL transfer (without the SFT initialization) suffered from "Reflection-Plan Conflict," where reflection quality degraded despite improved planning.
- The RPCO approach successfully maintained high reflection quality (31.0% "Good" reflections) while achieving the highest plan scores (0.9746).
GEdit-Bench: VisionCreator-R1 achieved the highest overall score (7.23) and significantly improved semantic consistency compared to base tools.

5. Significance

Paradigm Shift: Moves visual generation from rigid, plan-driven workflows to adaptive, self-correcting agentic systems capable of handling stochastic environments.
Training Stability: Provides a principled solution (RPCO) to the instability of training reflection mechanisms in long-horizon tasks, which has been a major bottleneck in agentic AI.
Resource Release: The release of VCR-SFT, VCR-RL, and VCR-Bench offers the community standardized tools to further research reflection-aware visual generation.
Practical Impact: Demonstrates that explicit reflection, when properly trained, leads to tangible improvements in visual fidelity and instruction following, bridging the gap between high-level planning and low-level visual execution.