ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

Imagine you are teaching a robot to drive a car. You have two very different teachers to help you:

The "Smart Talker" (The VLM): This teacher is like a brilliant, well-read philosopher. It can look at a scene, read the signs, understand the context, and say, "I see a red light and a pedestrian; I should stop." It's great at logic and language, but it sometimes struggles to visualize exactly what will happen next second-by-second.
The "Daydreamer" (The DWM): This teacher is like a movie director with a crystal ball. It can take the current scene and instantly generate a short movie of what the next few seconds will look like. It's amazing at predicting visual details (like a car swerving), but it sometimes lacks the high-level logic to know why it's doing what it's doing.

The Problem:
Until now, most self-driving cars used either the Smart Talker or the Daydreamer.

If you only use the Talker, the car might understand the rules but fail to react fast enough to a sudden visual change.
If you only use the Daydreamer, the car might generate a cool movie of the future but make a silly driving decision because it doesn't "think" deeply enough.

The Solution: ImagiDrive
The paper introduces ImagiDrive, which is like hiring both teachers and forcing them to work together in a continuous loop. Think of it as a "Rehearsal and Refine" cycle.

Here is how the loop works, step-by-step:

1. The Initial Guess (The Plan)

The car (the Smart Talker) looks at the current road and says, "Okay, I think I should turn right here." It makes a quick plan.

2. The Daydream (The Imagination)

The car hands this plan to the Daydreamer. The Daydreamer says, "Hold on, let me simulate what happens if you turn right right now." It instantly generates a short video of the next few seconds.

The Twist: In the simulation, the Daydreamer sees a problem! "Oh no, if you turn right now, you'll clip that oncoming car!"

3. The Correction (The Refinement)

The car looks at the Daydreamer's simulation, sees the danger, and says, "You're right! I need to slow down and wait." It changes its plan.

4. The Loop

The car takes the new plan, asks the Daydreamer to simulate it again, and checks if it's safe. It keeps doing this—Plan, Imagine, Check, Refine—until the plan is perfect.

The "Smart Shortcuts"

Doing this loop over and over can be slow and computationally expensive (like rehearsing a play 100 times before the show). To fix this, the authors added two clever tricks:

The "Stop When It's Good Enough" Button (Early Stopping): The system checks the plans. If the car's plan in Step 3 is almost identical to Step 2, it stops rehearsing. It realizes, "We aren't learning anything new; let's just drive." This saves time and battery.
The "Best Path" Picker (Trajectory Selection): Sometimes the loop generates a few slightly different paths. The system uses a compass-like logic to pick the one that is most consistent and safe, ignoring the weird, wobbly ones.

Why is this a big deal?

In the real world tests (using datasets like nuScenes and NAVSIM), this "Imagination-and-Planning" team beat all the previous solo acts.

Safety: The car avoided collisions much better because it could "see" the future before it happened.
Logic: It made smarter decisions because it could "talk" through the scenario.
Efficiency: Even with the extra thinking, the "Stop When Good Enough" trick kept it fast enough for real driving.

In a nutshell:
ImagiDrive is a self-driving car that doesn't just react to the road; it rehearses the future in its mind before making a move. It combines the brainpower of a philosopher with the visual imagination of a movie director to drive safer and smarter.

1. Problem Statement

Autonomous driving requires systems that can not only perceive the current environment but also reason about future scenarios to navigate safely. Current approaches face distinct limitations:

Vision-Language Models (VLMs): While excellent at scene understanding, logical reasoning, and zero-shot generalization, they often lack the ability to generate detailed, pixel-level future scenarios necessary for proactive planning.
Driving World Models (DWMs): These excel at generating realistic future visual sequences based on past observations and actions. However, they often struggle with precise action prediction and lack the high-level causal reasoning capabilities of VLMs.
Integration Challenge: Combining VLMs and DWMs is intuitive but difficult. Key challenges include effectively connecting high-level action decisions with low-level pixel predictions, maintaining computational efficiency (as both components are heavy), and ensuring the generated future scenarios accurately reflect the agent's planned trajectory.

2. Methodology: ImagiDrive

The authors propose ImagiDrive, an end-to-end framework that unifies a VLM-based Driving Agent and a DWM-based Scene Imaginer into a recurrent "Imagination-and-Planning" loop.

Core Components

Driving Agent (VLM-based):
- Built upon large VLMs (e.g., LLaVA-1.6, InternVL2.5).
- Inputs: Multi-modal data including visual frames (current or future), ego-state (speed), textual prompts, and learnable trajectory queries.
- Outputs: Structured natural language descriptions of intent and a sequence of 2D waypoints (trajectory).
- Robustness: Trained with distortion augmentations (blur, noise) to handle artifacts often present in generated future frames.
Scene Imaginer (DWM-based):
- Acts as a future scene synthesizer (e.g., Vista, Epona).
- Function: Takes the current observation and a predicted trajectory as conditions to generate a short-term video sequence of future frames.
- Role: Provides the "imagination" of what the world looks like if the agent executes a specific plan.
The Imagination-and-Planning Loop:
- Step 1 (Plan): The Driving Agent predicts an initial trajectory ( $Y_0$ ) based on the current frame.
- Step 2 (Imagine): The Scene Imaginer generates future frames conditioned on $Y_0$ .
- Step 3 (Refine): Selected future frames (e.g., at $t=0.5s$ and $t=1.0s$ ) are fed back into the Driving Agent. The agent re-evaluates the situation and refines the trajectory ( $Y_1, Y_2, \dots$ ).
- Iteration: This process repeats until a stopping criterion is met.

Efficiency and Safety Mechanisms

To address the computational cost and safety risks of iterative generation, two key strategies are introduced:

Early Stopping Strategy (ESS): Uses a Trajectory Convergence Ratio (TCR) metric. If the change between consecutive trajectory predictions falls below a threshold, the loop terminates early to save computation.
Trajectory Selection Strategy (TSS): Instead of simply taking the last iteration's output, the system maintains a buffer of all generated trajectories. It calculates the mean direction vector of all candidates and selects the trajectory that is most directionally consistent with this average. This ensures robustness against erratic predictions in individual iterations.

3. Key Contributions

Unified Framework: ImagiDrive is the first framework to tightly couple a VLM agent with a DWM scene imaginer in a recurrent loop, leveraging "imagination" to refine "planning."
Novel Architecture: Introduces a plug-and-play VLM driving agent capable of multi-modal input and structured trajectory output, alongside a training-free scene imaginer integration.
Efficiency Strategies: Proposes the TCR-based Early Stopping and Directional Consistency-based Trajectory Selection, balancing inference speed with safety.
Comprehensive Evaluation: Extensive validation on nuScenes, NAVSIM, and NeuroNCAP datasets under both open-loop and closed-loop conditions.

4. Experimental Results

The paper evaluates ImagiDrive against state-of-the-art (SOTA) end-to-end methods (e.g., UniAD, VAD, SparseDrive) and world-model-based approaches.

Closed-Loop Performance (NeuroNCAP):
- ImagiDrive-S (with the loop) significantly outperforms ImagiDrive-A (agent only) and all baselines.
- Using InternVL2.5, ImagiDrive-S achieved a NeuroNCAP score of 3.49 (vs. 1.60 for BridgeAD) and reduced the collision rate to 26.80% (vs. 79.60% for UniAD).
- The loop effectively helps the agent anticipate hazards (e.g., oncoming cars) that the agent might miss in a single-pass prediction.
Open-Loop Performance (Turning-nuScenes):
- In challenging turning scenarios, ImagiDrive-S reduced the average collision rate to 0.22% (compared to 0.32% for MomAD and 0.40% for SparseDrive).
- It demonstrated superior trajectory deviation (L2 error) compared to previous VLM-based methods.
NAVSIM Benchmark:
- ImagiDrive-S achieved the highest Predictive Driver Model Score (PDMS) of 87.4, outperforming strong end-to-end planners like VADv2 and TransFuser.
- It excelled in "Ego Progress" (EP) and "No At-Fault Collision" (NC) metrics, proving its ability to handle complex dynamic interactions.
Ablation Studies:
- Combining ESS and TSS reduced the average number of iterations from 5 to 2.3 while lowering the collision rate from 0.39% to 0.21%, proving the efficiency of the proposed strategies.

5. Significance

ImagiDrive represents a paradigm shift in autonomous driving by moving from static, single-pass planning to a dynamic, iterative reasoning process.

Cognitive Simulation: It mimics human driving behavior where a driver "imagines" the outcome of a maneuver before committing to it.
Safety & Robustness: By validating plans against generated future scenarios, the system can detect and correct potential collisions before they occur, significantly improving safety in complex, dynamic environments.
Scalability: The modular design allows for easy integration of stronger VLMs or World Models as they evolve, making it a flexible foundation for future autonomous systems.

In conclusion, ImagiDrive demonstrates that integrating the reasoning power of VLMs with the generative foresight of World Models creates a superior planning framework that is both safer and more adaptable than current state-of-the-art methods.