DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Imagine you are teaching a robot how to drive a car. You can't just let it practice on real roads immediately; that's too dangerous and expensive. Instead, you want to build a super-realistic video game simulator where the robot can imagine the future: "If I turn left here, what will the other cars do? What will the rain look like? Will a pedestrian suddenly step out?"

This is the goal of Generative World Models for autonomous driving. They are AI systems that can "dream" up future driving scenarios.

However, there's a problem: How do you know if the simulator is any good?

The Problem: The "Fake News" of Driving

Currently, researchers have built many of these simulators, but they are testing them with the wrong ruler.

The Old Way: They used general video metrics. It's like judging a driving simulator by asking, "Does this video look pretty?" A model could generate a beautiful, sunny beach drive, but if the physics are wrong (e.g., a car floats through a wall or a pedestrian vanishes into thin air), it's useless for teaching a real car.
The Missing Pieces: Existing tests ignore:
- Safety: Does the video handle fog, night, or snow?
- Physics: Do the cars move like real cars, or do they jitter and teleport?
- Consistency: Do the other cars stay the same car, or do they morph into different vehicles frame-by-frame?
- Control: If you tell the simulator "turn left," does it actually turn left, or does it ignore you?

The Solution: Introducing "DrivingGen"

The authors of this paper created DrivingGen, the first comprehensive "report card" specifically designed to grade these driving simulators. Think of it as a driving school final exam for AI.

1. The Test Track (The Dataset)

Previous tests mostly used sunny, daytime city drives. DrivingGen is like a driving school that throws everything at you:

Weather: Rain, snow, fog, sandstorms, and floods.
Time: Dawn, day, night, and sunset.
Locations: From busy streets in Tokyo to highways in the US and rural roads in Africa.
Scenarios: Aggressive cut-ins, pedestrians waiting at crosswalks, and dense traffic jams.

They created a dataset of 400 diverse scenarios to ensure the AI isn't just memorizing one type of road.

2. The Grading Rubric (The Metrics)

DrivingGen doesn't just look at the video; it looks at the physics and the logic. They use four main categories to grade the AI:

Distribution (The "Vibe Check"): Does the generated world look like a real driving dataset, or does it look like a cartoon? They measure how close the "feel" of the video is to reality.
Quality (The "Visuals"):
- Subjective: Does it look good to a human?
- Objective: Are there weird flickering lights (like from streetlamps) that would confuse a real car's sensors?
- Trajectory: Is the ride smooth, or does it jerk around like a drunk driver?
Temporal Consistency (The "Memory Test"):
- Scene: Does the background stay stable?
- Agents: If a red car is there in frame 1, is it still a red car in frame 50?
- The "Ghost" Test: Did a car suddenly disappear without a reason (like driving off a cliff)? Real cars don't just vanish; the AI shouldn't make them vanish either.
Trajectory Alignment (The "Obedience Test"): If you tell the AI, "Drive this specific path," does it actually follow it? Many models look pretty but ignore the instructions.

The Results: The "Pretty but Dumb" vs. "Ugly but Smart" Dilemma

The authors tested 14 different AI models (from big tech companies and open-source projects) using this new exam. Here is what they found:

The "Hollywood" Models: Some general video models (like the ones that make cool movie clips) make stunning, beautiful videos. They look amazing, but they break the laws of physics. Cars might slide sideways, or pedestrians might teleport. They are great for art, but terrible for driving.
The "Engineer" Models: Some models built specifically for driving are very good at physics. They follow the rules of the road and stay on the path. However, their videos often look blurry, weird, or low-quality.
The Gap: No single model is perfect yet. The best models are either "pretty but dangerous" or "safe but ugly."

Why This Matters

DrivingGen is a massive step forward because it stops researchers from just making "pretty pictures." It forces them to build simulators that are safe, reliable, and controllable.

Think of it this way: Before DrivingGen, we were judging driving simulators by how well they could paint a sunset. Now, we are judging them by how well they can navigate a snowstorm without crashing. This benchmark will help engineers build the "perfect" driving simulator, which is the key to making self-driving cars safe enough for us all to use.

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

The Problem: The "Fake News" of Driving

The Solution: Introducing "DrivingGen"

1. The Test Track (The Dataset)

2. The Grading Rubric (The Metrics)

The Results: The "Pretty but Dumb" vs. "Ugly but Smart" Dilemma

Why This Matters

1. Problem Statement

2. Methodology: The DrivingGen Benchmark

A. Benchmark Dataset

B. Evaluation Metrics

3. Key Contributions

4. Experimental Results & Insights

5. Significance

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

The Problem: The "Fake News" of Driving

The Solution: Introducing "DrivingGen"

1. The Test Track (The Dataset)

2. The Grading Rubric (The Metrics)

The Results: The "Pretty but Dumb" vs. "Ugly but Smart" Dilemma

Why This Matters

1. Problem Statement

2. Methodology: The DrivingGen Benchmark

A. Benchmark Dataset

B. Evaluation Metrics

3. Key Contributions

4. Experimental Results & Insights

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers