DreamToNav: Generalizable Navigation for Robots via Generative Video Planning

DreamToNav is a novel autonomous robot framework that leverages generative video models to translate natural language prompts into executable motion plans, enabling robots to "dream" and physically execute complex navigation tasks with high accuracy across different locomotion platforms.

Valerii Serpiva, Jeffrin Sam, Chidera Simon, Hajira Amjad, Iana Zhura, Artem Lykov, Dzmitry Tsetserukou

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you have a robot, but instead of being a rigid machine that only understands complex math coordinates, it has a vivid imagination.

This paper introduces DreamToNav, a new way to tell robots where to go. Instead of giving them a GPS-style list of "turn left at 5 meters, then go straight 10 meters," you can just talk to them naturally, like you would to a human friend.

Here is how it works, broken down into a simple story with some creative analogies:

1. The Problem: The Robot is Too Literal

Traditionally, if you want a robot to "follow that person politely," you have to write thousands of lines of code to define what "polite" means (e.g., "stay 2 meters away," "don't cut them off"). It's like trying to teach a dog to do ballet by writing a spreadsheet of muscle movements. It's rigid, boring, and fails when things get messy.

2. The Solution: "Dreaming" the Future

DreamToNav changes the game. It treats video generation as a planning tool. Think of it like this:

  • The Old Way: You give the robot a map and a set of rules. It calculates the path.
  • The DreamToNav Way: You tell the robot, "Go get that blue cup without bumping the chair." The robot then closes its eyes (metaphorically) and dreams a short movie of itself successfully doing exactly that.

3. The Three-Step Magic Trick

The system uses three AI "brains" to make this happen:

Step A: The Translator (Qwen 2.5-VL)

You might say something vague like, "Go over there." The robot needs more detail.

  • Analogy: Imagine you are an architect. You tell your assistant, "Build a house." The assistant asks, "What style? How many rooms?"
  • What it does: This AI takes your vague sentence and your current camera view, then rewrites it into a super-detailed script. It turns "Go over there" into: "Move forward at a slow pace, curve gently to the left to avoid the chair, and stop 1 meter from the blue cup."

Step B: The Dreamer (NVIDIA Cosmos 2.5)

Now that the robot has a clear script, it needs to visualize the action.

  • Analogy: This is like a Hollywood special effects studio. The robot doesn't just calculate numbers; it generates a video of itself moving through the room, avoiding obstacles, and reaching the goal.
  • The Magic: Because this AI was trained on real-world physics, the video it creates isn't just a cartoon; it respects gravity, obstacles, and how robots actually move. It's a "what-if" simulation.

Step C: The Detective (Trajectory Extraction)

Now the robot has a video of itself doing the task. How does it turn that video into actual movement?

  • Analogy: Imagine watching a movie of a car driving down a street. A detective watches the movie frame-by-frame, draws a line under the car's tires, and says, "Okay, the car went here, then here, then here."
  • What it does: The system looks at the generated video, finds the robot in every frame, and traces its path. It turns the "movie" back into a set of coordinates (a path) that the real robot can follow.

4. The Result: "I Dreamed It, So I Can Do It"

The researchers tested this on two very different robots:

  1. A wheeled robot (like a Roomba on steroids).
  2. A quadruped robot (a robot dog).

They gave them natural language commands like "Follow that person" or "Go to the red object."

  • The Outcome: The robots successfully "dreamed" the path, extracted the instructions, and executed them in the real world.
  • The Score: They succeeded about 77% of the time. When they succeeded, they were incredibly accurate, stopping within a few inches of their target and staying on the path perfectly.

Why This Matters

This is a huge leap forward because it removes the need for engineers to program specific rules for every new situation.

  • Before: To make a robot navigate a crowded party, you'd need to code rules for every possible human movement.
  • Now: You just say, "Navigate the party carefully," and the robot imagines the best way to do it, then does it.

In a nutshell: DreamToNav lets robots use their imagination to solve problems. They visualize the solution in a movie first, and if the movie looks good, they go out and act it out in real life. It's the difference between a robot that follows a script and a robot that understands the story.