DreamToNav: Generalizable Navigation for Robots via Generative Video Planning

Imagine you have a robot, but instead of being a rigid machine that only understands complex math coordinates, it has a vivid imagination.

This paper introduces DreamToNav, a new way to tell robots where to go. Instead of giving them a GPS-style list of "turn left at 5 meters, then go straight 10 meters," you can just talk to them naturally, like you would to a human friend.

Here is how it works, broken down into a simple story with some creative analogies:

1. The Problem: The Robot is Too Literal

Traditionally, if you want a robot to "follow that person politely," you have to write thousands of lines of code to define what "polite" means (e.g., "stay 2 meters away," "don't cut them off"). It's like trying to teach a dog to do ballet by writing a spreadsheet of muscle movements. It's rigid, boring, and fails when things get messy.

2. The Solution: "Dreaming" the Future

DreamToNav changes the game. It treats video generation as a planning tool. Think of it like this:

The Old Way: You give the robot a map and a set of rules. It calculates the path.
The DreamToNav Way: You tell the robot, "Go get that blue cup without bumping the chair." The robot then closes its eyes (metaphorically) and dreams a short movie of itself successfully doing exactly that.

3. The Three-Step Magic Trick

The system uses three AI "brains" to make this happen:

Step A: The Translator (Qwen 2.5-VL)

You might say something vague like, "Go over there." The robot needs more detail.

Analogy: Imagine you are an architect. You tell your assistant, "Build a house." The assistant asks, "What style? How many rooms?"
What it does: This AI takes your vague sentence and your current camera view, then rewrites it into a super-detailed script. It turns "Go over there" into: "Move forward at a slow pace, curve gently to the left to avoid the chair, and stop 1 meter from the blue cup."

Step B: The Dreamer (NVIDIA Cosmos 2.5)

Now that the robot has a clear script, it needs to visualize the action.

Analogy: This is like a Hollywood special effects studio. The robot doesn't just calculate numbers; it generates a video of itself moving through the room, avoiding obstacles, and reaching the goal.
The Magic: Because this AI was trained on real-world physics, the video it creates isn't just a cartoon; it respects gravity, obstacles, and how robots actually move. It's a "what-if" simulation.

Step C: The Detective (Trajectory Extraction)

Now the robot has a video of itself doing the task. How does it turn that video into actual movement?

Analogy: Imagine watching a movie of a car driving down a street. A detective watches the movie frame-by-frame, draws a line under the car's tires, and says, "Okay, the car went here, then here, then here."
What it does: The system looks at the generated video, finds the robot in every frame, and traces its path. It turns the "movie" back into a set of coordinates (a path) that the real robot can follow.

4. The Result: "I Dreamed It, So I Can Do It"

The researchers tested this on two very different robots:

A wheeled robot (like a Roomba on steroids).
A quadruped robot (a robot dog).

They gave them natural language commands like "Follow that person" or "Go to the red object."

The Outcome: The robots successfully "dreamed" the path, extracted the instructions, and executed them in the real world.
The Score: They succeeded about 77% of the time. When they succeeded, they were incredibly accurate, stopping within a few inches of their target and staying on the path perfectly.

Why This Matters

This is a huge leap forward because it removes the need for engineers to program specific rules for every new situation.

Before: To make a robot navigate a crowded party, you'd need to code rules for every possible human movement.
Now: You just say, "Navigate the party carefully," and the robot imagines the best way to do it, then does it.

In a nutshell: DreamToNav lets robots use their imagination to solve problems. They visualize the solution in a movie first, and if the movie looks good, they go out and act it out in real life. It's the difference between a robot that follows a script and a robot that understands the story.

Here is a detailed technical summary of the paper "DreamToNav: Generalizable Navigation for Robots via Generative Video Planning."

1. Problem Statement

Autonomous robot navigation in human-populated spaces requires more than simple collision avoidance; it demands a semantic understanding of the environment and the ability to interpret high-level, natural language instructions (e.g., "Follow that person politely").

Limitations of Current Methods: Traditional approaches rely on hand-crafted cost functions, rule-based planners, or rigid waypoint definitions. These struggle to encode nuanced social behaviors, interpersonal distances, and complex spatial reasoning required by open-ended language commands.
The Gap: There is a need for a framework that bridges the gap between natural language intent and executable robot motion without requiring task-specific engineering or reward shaping for every new scenario.

2. Methodology: The DreamToNav Pipeline

DreamToNav proposes a "video-as-planning" paradigm where the robot visually "dreams" a future sequence of actions before executing them. The system operates in three distinct stages:

A. Prompt Refinement (Semantic Grounding)

Input: A scene image ( $I_0$ ) and a raw, potentially vague user instruction ( $p_{raw}$ ).
Mechanism: The system employs Qwen 2.5-VL-7B-Instruct, a Large Vision-Language Model (LVLM), to act as a reasoning agent.
Process: The LVLM performs a three-stage reasoning pass:
1. Scene Grounding: Identifies salient objects, spatial relationships, and obstacles.
2. Reference Resolution: Anchors ambiguous terms (e.g., "that person") to specific entities in the image.
3. Motion Decomposition: Translates high-level intent into precise visual descriptions including direction, speed, and social constraints (e.g., "move forward at 1 m/s, curving left to avoid the chair").
Output: A structured, metric-grounded natural language prompt ( $\hat{p}$ ) suitable for video generation.

B. Video Generation (Visual Planning)

Mechanism: The refined prompt ( $\hat{p}$ ) and initial frame ( $I_0$ ) are fed into NVIDIA Cosmos 2.5, a state-of-the-art world foundation model.
Process: Cosmos 2.5 performs conditional latent denoising to synthesize a physically consistent video sequence ( $V_{syn}$ $V_{sy n}$ ) of the robot performing the task.
- It models realistic kinematic constraints, object permanence, and environment dynamics.
- Third-Person View (TPV): To aid trajectory extraction, the model generates a synthetic Third-Person View (elevated external camera) alongside the standard view. This provides a global perspective of the robot relative to obstacles, reducing localization ambiguity.

C. Trajectory Extraction & Execution

Robot Detection: A lightweight YOLO11n model detects the robot in both real and generated frames. The model was trained on a custom dataset containing real indoor images and synthetic diffusion-generated images to ensure robustness.
Pose Estimation:
- Camera Pose: Estimated using ORB-SLAM3 for visual odometry.
- Robot Pose: Estimated using a PnP (Perspective-n-Point) algorithm with IPPE. This uses the 2D bounding box from the detector and known 3D robot dimensions to calculate the 6-DoF pose relative to the camera.
Filtering & Transformation:
- An Extended Kalman Filter (EKF) is applied to the sequence of estimated positions to enforce temporal consistency and reduce noise.
- The filtered robot positions are transformed from the camera frame to the world frame using the estimated camera pose.
Execution: The resulting 3D trajectory is projected onto the ground plane to create a 2D reference curve, which is then executed by the physical robot.

3. Key Contributions

Video-as-Planning Paradigm: Demonstrates that generative video models can serve as implicit planners for diverse robotic platforms (wheeled and legged), effectively solving the planning problem by synthesizing physically plausible future states.
Human-in-the-Loop Interface: Introduces a novel interaction model where users control robots using only a scene image and natural language, eliminating the need for explicit waypoint definition or complex parameter tuning.
Generalizability: The framework successfully transfers across different robot morphologies (Unmanned Ground Vehicles and Quadrupeds) without modification to the core pipeline.
Decoupled Architecture: Separates semantic reasoning (VLM) from physical prediction (World Model), allowing the system to handle nuanced social instructions without task-specific training data.

4. Experimental Results

The system was evaluated on a wheeled mobile robot (UGV) and a quadruped robot in cluttered indoor environments.

Success Rate: Achieved 76.7% (23/30 trials) across all tasks.
Accuracy Metrics:
- Final Goal Error: Typically within 0.05–0.10 meters.
- Trajectory Tracking Error: Typically below 0.15 meters (average deviation 0.03–0.08 m).
Task Performance:
- UGV: Successfully navigated to specific colored targets (red/blue squares) with smooth paths.
- Quadruped: Successfully navigated around obstacles, demonstrating the ability to generate curved, obstacle-aware paths.
Comparison: The executed trajectories closely overlapped with the generated video predictions and the ground-truth trajectories recorded by a VICON motion capture system.

5. Significance and Future Work

Significance: DreamToNav validates that generative video models can be reliable engines for robotic control. It shifts the paradigm from "planning then simulating" to "simulating (dreaming) then planning," offering a unified framework for obstacle avoidance and goal-directed navigation that is intuitive for human operators.
Limitations: Failures occurred when the generated video misinterpreted the scene layout or when pose estimation errors accumulated. The reliance on visual generation and detection means inaccuracies in these stages propagate to the motion plan.
Future Directions: Improving the robustness of trajectory extraction, integrating physical constraints directly into the generation process, and testing in more complex, dynamic environments.

In conclusion, DreamToNav represents a significant step toward flexible, intuitive robot autonomy, proving that "dreaming" a future action sequence via generative AI is a viable strategy for real-world navigation.