Visual Planning: Let's Think Only with Images

This paper proposes "Visual Planning," a new paradigm that enhances reasoning in spatial and geometric tasks by leveraging purely visual representations for step-by-step inference, and introduces a reinforcement learning framework (VPRL) that demonstrates superior performance over text-only methods in visual navigation benchmarks.

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić

Published 2026-02-23
📖 5 min read🧠 Deep dive

The Big Idea: Stop Talking, Start Sketching

Imagine you are trying to navigate a complex maze.

  • The Old Way (Text-Based): You describe the maze to a friend in words: "Okay, I'm at the start. There's a wall to my left, so I turn right. Then I see a hole, so I go up..." You keep talking, describing every single step, until you finally say, "I'm at the exit!"
  • The Problem: Sometimes, words are clumsy. Describing a 3D turn or a complex spatial relationship in sentences is slow, prone to errors, and can get confusing. It's like trying to explain how to tie your shoes using only a dictionary definition.
  • The New Way (Visual Planning): Instead of talking, you just draw the path. You sketch the start, then sketch the next step, then the next, until you reach the finish line. You don't say a word; you just show the journey.

This paper argues that for tasks involving space, movement, and geometry (like navigating a maze or moving a robot), thinking in pictures is often better than thinking in words.


The Main Characters

  1. The "Talker" (Current AI): Most modern AI models (like the ones you chat with) are great at language. Even when they look at a picture, they immediately turn that picture into words in their "mind" before they solve the problem. They are like a tour guide who refuses to point; they only describe the view.
  2. The "Sketcher" (The New Model): The researchers built a special AI that doesn't speak. It only "sees" and "draws." It takes an image of a problem and generates a sequence of new images that represent the solution. It's like an artist who solves a puzzle by painting the solution frame-by-frame.

The Analogy: The Maze Runner

Think of the AI as a runner trying to get through a giant, foggy maze.

  • The Text-Based Runner: This runner has a megaphone. Every time they take a step, they shout, "I am moving left! Now I am moving forward! Watch out for the pit!"
    • The Flaw: If the runner gets confused, they might shout the wrong direction. Also, shouting takes time. If the maze is huge, they might run out of breath (or computing power) before they finish.
  • The Visual Runner (Visual Planning): This runner doesn't speak. They just look at the map, take a step, look at the new view, take another step, and so on. They are essentially simulating the run in their head by generating a movie of the journey.
    • The Benefit: They don't get stuck translating "left turn" into words. They just see the turn. This makes them faster and more accurate at navigating complex spaces.

How They Taught the AI (The Training Camp)

You can't just tell an AI to "draw a path" and expect it to work immediately. The researchers used a clever two-step training method called VPRL (Visual Planning via Reinforcement Learning).

Think of this like training a dog to fetch a ball, but the dog is a robot and the ball is a correct path.

  1. Stage 1: The "Playground" (Exploration):
    First, they let the AI wander around the maze randomly. It doesn't know the rules yet. It just learns how to move its "feet" (generate images) without falling off the cliff. It's like letting a child run around a playground to learn how to walk before teaching them a specific game.
  2. Stage 2: The "Coach" (Reinforcement Learning):
    Now, the AI tries to solve the maze.
    • If it draws a path that hits a wall? Bad! The coach gives a "thumbs down" (a penalty).
    • If it draws a path that gets closer to the goal? Good! The coach gives a "thumbs up" (a reward).
    • Over thousands of tries, the AI learns: "Oh, I shouldn't draw that wall collision. I should draw a path that goes around it."

The magic here is that the AI learns by seeing the consequences of its actions, not by reading a textbook about them.

Why This Matters (The Results)

The researchers tested this on three types of puzzles:

  1. Frozen Lake: A slippery grid where you must avoid holes.
  2. Maze: A classic labyrinth.
  3. Mini-Behavior: A robot that must pick up an object and put it on a table.

The Results:

  • The "Talker" models (even the very smart ones like Gemini) often got confused. They would describe the maze wrong or get lost in their own sentences.
  • The "Sketcher" model (Visual Planning) was significantly better. It solved the puzzles more often and handled harder, bigger mazes much more gracefully.
  • Even when the maze got huge (making it very hard for humans to visualize), the Visual Planner kept its cool, while the text-based models fell apart.

The Takeaway

We often assume that "thinking" means "talking." We think that to solve a problem, we need to explain it in words.

This paper suggests that for spatial problems (navigation, robotics, geometry), our brains (and now our AI) have a secret superpower: imagination. We can solve problems by visualizing the future steps, just like a chess player visualizes the board or a driver visualizes a parking maneuver.

By letting AI "think" in images instead of words, we unlock a more natural, intuitive, and powerful way to solve complex physical problems. It's not about replacing language; it's about adding a new tool to the toolbox: The ability to plan with pictures.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →