Visual Planning: Let's Think Only with Images

The Big Idea: Stop Talking, Start Sketching

Imagine you are trying to navigate a complex maze.

The Old Way (Text-Based): You describe the maze to a friend in words: "Okay, I'm at the start. There's a wall to my left, so I turn right. Then I see a hole, so I go up..." You keep talking, describing every single step, until you finally say, "I'm at the exit!"
The Problem: Sometimes, words are clumsy. Describing a 3D turn or a complex spatial relationship in sentences is slow, prone to errors, and can get confusing. It's like trying to explain how to tie your shoes using only a dictionary definition.
The New Way (Visual Planning): Instead of talking, you just draw the path. You sketch the start, then sketch the next step, then the next, until you reach the finish line. You don't say a word; you just show the journey.

This paper argues that for tasks involving space, movement, and geometry (like navigating a maze or moving a robot), thinking in pictures is often better than thinking in words.

The Main Characters

The "Talker" (Current AI): Most modern AI models (like the ones you chat with) are great at language. Even when they look at a picture, they immediately turn that picture into words in their "mind" before they solve the problem. They are like a tour guide who refuses to point; they only describe the view.
The "Sketcher" (The New Model): The researchers built a special AI that doesn't speak. It only "sees" and "draws." It takes an image of a problem and generates a sequence of new images that represent the solution. It's like an artist who solves a puzzle by painting the solution frame-by-frame.

The Analogy: The Maze Runner

Think of the AI as a runner trying to get through a giant, foggy maze.

The Text-Based Runner: This runner has a megaphone. Every time they take a step, they shout, "I am moving left! Now I am moving forward! Watch out for the pit!"
- The Flaw: If the runner gets confused, they might shout the wrong direction. Also, shouting takes time. If the maze is huge, they might run out of breath (or computing power) before they finish.
The Visual Runner (Visual Planning): This runner doesn't speak. They just look at the map, take a step, look at the new view, take another step, and so on. They are essentially simulating the run in their head by generating a movie of the journey.
- The Benefit: They don't get stuck translating "left turn" into words. They just see the turn. This makes them faster and more accurate at navigating complex spaces.

How They Taught the AI (The Training Camp)

You can't just tell an AI to "draw a path" and expect it to work immediately. The researchers used a clever two-step training method called VPRL (Visual Planning via Reinforcement Learning).

Think of this like training a dog to fetch a ball, but the dog is a robot and the ball is a correct path.

Stage 1: The "Playground" (Exploration):
First, they let the AI wander around the maze randomly. It doesn't know the rules yet. It just learns how to move its "feet" (generate images) without falling off the cliff. It's like letting a child run around a playground to learn how to walk before teaching them a specific game.
Stage 2: The "Coach" (Reinforcement Learning):
Now, the AI tries to solve the maze.
- If it draws a path that hits a wall? Bad! The coach gives a "thumbs down" (a penalty).
- If it draws a path that gets closer to the goal? Good! The coach gives a "thumbs up" (a reward).
- Over thousands of tries, the AI learns: "Oh, I shouldn't draw that wall collision. I should draw a path that goes around it."

The magic here is that the AI learns by seeing the consequences of its actions, not by reading a textbook about them.

Why This Matters (The Results)

The researchers tested this on three types of puzzles:

Frozen Lake: A slippery grid where you must avoid holes.
Maze: A classic labyrinth.
Mini-Behavior: A robot that must pick up an object and put it on a table.

The Results:

The "Talker" models (even the very smart ones like Gemini) often got confused. They would describe the maze wrong or get lost in their own sentences.
The "Sketcher" model (Visual Planning) was significantly better. It solved the puzzles more often and handled harder, bigger mazes much more gracefully.
Even when the maze got huge (making it very hard for humans to visualize), the Visual Planner kept its cool, while the text-based models fell apart.

The Takeaway

We often assume that "thinking" means "talking." We think that to solve a problem, we need to explain it in words.

This paper suggests that for spatial problems (navigation, robotics, geometry), our brains (and now our AI) have a secret superpower: imagination. We can solve problems by visualizing the future steps, just like a chess player visualizes the board or a driver visualizes a parking maneuver.

By letting AI "think" in images instead of words, we unlock a more natural, intuitive, and powerful way to solve complex physical problems. It's not about replacing language; it's about adding a new tool to the toolbox: The ability to plan with pictures.

1. Problem Statement

Current Multimodal Large Language Models (MLLMs) typically process visual inputs but perform reasoning exclusively in the textual domain. Even when solving spatial or geometric tasks (e.g., maze navigation, route planning), these models convert visual information into text (captions, coordinates, or ASCII) before generating a plan.

The authors argue that this "text-mediated" approach introduces a modality gap:

Inefficiency: Verbal descriptions often fail to capture complex spatial relationships, geometric constraints, or state transitions accurately.
Cognitive Mismatch: Human cognition utilizes both verbal and non-verbal (visual) channels (Dual Coding Theory). For "vision-first" tasks, forcing visual reasoning into text may be suboptimal.
Error Propagation: Grounding visual states into text before reasoning can lead to hallucinations or misinterpretations of the environment layout, limiting the model's ability to plan effectively.

Core Question: Can models perform planning and reasoning purely through visual representations (sequences of images) without any textual mediation?

2. Methodology: Visual Planning via Reinforcement Learning (VPRL)

The paper proposes a new paradigm called Visual Planning, where the reasoning process is a sequence of generated images representing state transitions, rather than a sequence of text tokens.

2.1 The Visual Planning Paradigm

Definition: Given an initial image $v_0$ , the model generates a trajectory of intermediate images $\hat{T} = (\hat{v}_1, \dots, \hat{v}_n)$ representing the plan.
Autoregressive Generation: The model $\pi_\theta$ predicts the next visual state conditioned on the initial state and all previously generated states:
$\hat{v}_i \sim \pi_\theta(v_i | v_0, \hat{v}_1, \dots, \hat{v}_{i-1})$
Model Backbone: The authors use a Large Vision Model (LVM-7B) trained exclusively on image sequences and video frames, with zero exposure to text data during pre-training. This eliminates language as a confounding factor.

2.2 VPRL Framework (Two-Stage Training)

To train this model for planning, the authors introduce VPRL, a two-stage reinforcement learning framework powered by Group Relative Policy Optimization (GRPO).

Stage 1: Policy Initialization (Exploration)
- Goal: Initialize the policy to generate valid visual sequences and encourage exploration.
- Method: The model is trained via Supervised Fine-Tuning (SFT) on random trajectories (random walks) within the environment.
- Purpose: This prevents the model from collapsing into suboptimal behaviors early on and ensures it learns the format of valid state transitions before optimizing for goals.
Stage 2: Reinforcement Learning (Optimization)
- Goal: Optimize the policy to reach the goal state efficiently.
- Algorithm: GRPO is used to update the policy based on a Progress Reward.
- Reward Design:
  - A Dynamics Interpreter ( $D$ ) parses the transition between the current state and the generated candidate state to determine validity (e.g., did the agent hit a wall?).
  - A Progress Estimator ( $P$ ) calculates the distance to the goal (e.g., via BFS).
  - Reward Function:
    - $+1$ (Optimal): The action reduces the distance to the goal.
    - $0$ (Non-optimal): The action is valid but does not reduce distance.
    - $-5$ (Invalid): The action violates physical constraints (e.g., walking through a wall).
- Mechanism: The model generates a group of candidate next states. GRPO computes relative advantages within the group to update the policy, encouraging the generation of states that lead to goal progress.

3. Key Contributions

Visual Planning Paradigm: The first demonstration of a reasoning paradigm where planning is executed purely through sequences of images, bypassing language entirely.
VPRL Framework: A novel two-stage RL framework (Random Walk Initialization + GRPO Optimization) specifically designed for goal-oriented visual planning in large vision models.
Empirical Validation: Comprehensive experiments showing that visual planning outperforms text-based reasoning in spatial tasks, particularly in generalization to out-of-distribution (OOD) scenarios.
Modality Gap Analysis: Evidence that text-based grounding introduces significant errors in spatial reasoning, which visual planning successfully avoids.

4. Experimental Results

4.1 Tasks

The authors evaluated the approach on three grid-based navigation tasks:

FROZENLAKE: Navigate a grid avoiding holes.
MAZE: Navigate a maze from start to finish.
MINIBEHAVIOR: Pick up an object and drop it at a specific location.

4.2 Performance Metrics

Exact Match (EM): The entire generated trajectory matches an optimal path.
Progress Rate (PR): The ratio of consecutive correct steps toward the goal.

4.3 Key Findings

Superiority over Text: VPRL significantly outperformed all text-based baselines (including SFT and CoT variants of Qwen-2.5-VL and proprietary models like Gemini 2.5 Pro).
- Average Improvement: VPRL achieved a 27% higher average Exact Match rate compared to text-based SFT.
- Specific Results: On FROZENLAKE, VPRL achieved 91.6% EM and 93.2% PR, compared to ~68% for text-based SFT.
Generalization (OOD): As task complexity increased (e.g., larger grid sizes from 3x3 to 6x6), text-based models suffered sharp performance drops. VPRL maintained high accuracy with a much flatter performance curve, demonstrating superior robustness.
Reduction of Invalid Actions: VPRL reduced the "invalid-failure ratio" (failures caused by illegal moves like hitting walls) by at least 24% compared to supervised visual planning (VPFT).
Ablation on Stage 1: Removing the random-walk initialization (Stage 1) and starting directly with optimal trajectories (VPFT) resulted in poor exploration and lower final performance, validating the necessity of the two-stage approach.

5. Significance and Future Impact

Rethinking Multimodal Reasoning: The paper challenges the dogma that text is the necessary medium for reasoning. It proves that for spatial and geometric tasks, "thinking in images" is not only viable but superior.
Efficiency in Complex Reasoning: While image generation incurs computational overhead, the paper notes that text-based "thinking" models (like Gemini 2.5 Pro) often generate thousands of tokens with high error rates, making visual planning a potentially more efficient and accurate alternative for specific domains.
Applications: This paradigm opens new avenues for robotics, autonomous navigation, and assistive technologies where agents must make decisions based on visual state transitions without relying on intermediate symbolic descriptions.
Foundation for Future Work: It suggests a path toward "interleaved" multimodal systems where text and images are used complementarily, but where visual-only reasoning is available for tasks where language is a bottleneck.

In conclusion, Visual Planning establishes that Large Vision Models can learn to plan effectively by generating sequences of images, offering a robust, text-free alternative to traditional multimodal reasoning that significantly outperforms current state-of-the-art methods in spatial planning tasks.