UniFuture: A 4D Driving World Model for Future Generation and Perception

UniFuture introduces a unified 4D driving world model that jointly generates future RGB images and depth maps through a dual-latent sharing scheme and multi-scale latent interaction, achieving superior performance in both dynamic scene forecasting and geometric perception compared to existing specialized models.

Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, Xiang Bai

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you are driving a car. Right now, your eyes see the road, other cars, and buildings in 2D (flat pictures). But your brain knows that the world is actually 3D (it has depth, distance, and volume) and that it is constantly moving (changing over time).

Most current AI models for self-driving cars are like amazing painters or movie directors. They are great at predicting what the next frame of a video will look like. They can guess that a car will move forward or a light will turn red. However, they are terrible at understanding the physics of that movement. They might paint a car that looks real but is floating in the air, or they might make a building stretch like taffy as it moves. They are "hallucinating" a movie, not simulating reality.

Other models are like 3D scanners. They are very good at measuring how far away things are right now, but they are "frozen in time." They can't guess what the world will look like five seconds from now.

UniFuture is the new kid on the block that combines both superpowers. It doesn't just paint a movie; it simulates a living, breathing 4D world.

Here is how it works, using some simple analogies:

1. The "Double-Exposure" Camera (Dual-Latent Sharing)

Imagine taking a photo of a street scene. Now, imagine taking a second photo of the exact same scene, but this time, instead of colors, the photo shows only the distance to every object (a depth map).

Old AI models treated these two photos as completely separate tasks. They had one brain for colors and a different brain for distance.
UniFuture says, "Wait a minute! These are the same scene!"
It uses a Dual-Latent Sharing scheme. Think of this as a shared notebook. Instead of writing the story of the colors in one notebook and the story of the distances in another, UniFuture writes both stories in the same notebook. This forces the AI to understand that if a car is "red" (color), it must also be "50 meters away" (distance). They are locked together, just like they are in the real world.

2. The "Tug-of-War" Team (Multi-scale Latent Interaction)

Now that the AI has this shared notebook, how does it make sure the story makes sense? It uses a mechanism called Multi-scale Latent Interaction.

Think of this as a construction crew building a house:

  • The Geometry Team (The Architects): They hold the blueprints. They say, "The wall must be straight. The car cannot pass through the wall." They act as a strict supervisor.
  • The Texture Team (The Painters): They want to make the wall look beautiful and the car look shiny.

In UniFuture, these two teams are in a constant, helpful tug-of-war:

  • The Architects tell the Painters, "You can't paint a car here because the blueprint says there's a wall." This stops the AI from making impossible hallucinations.
  • The Painters tell the Architects, "Hey, the shadow on the ground suggests that wall is actually curved, not straight." This helps the Architects refine their blueprints.

This back-and-forth ensures that the final result is a video that looks realistic and obeys the laws of physics.

3. The Crystal Ball (Future Generation)

When you ask UniFuture to predict the future, it doesn't just guess random pixels. Because it understands the 3D structure and the physics of the scene, it can simulate what happens next.

  • Old AI: "I think the car will move forward... maybe it will turn into a bird? Who knows!" (It might look cool, but it's nonsense).
  • UniFuture: "The car is 10 meters away and moving at 30 mph. In 2 seconds, it will be 16 meters away. The road curves left, so the car will follow the curve."

It generates a 4D Point Cloud. Imagine taking a video of the future, but every pixel in that video also knows exactly how far away it is. You can take that video and spin it around in 3D space, and it will still look correct.

Why Does This Matter?

For self-driving cars, this is a game-changer.

  • Safety: If the car knows the true 3D distance of an obstacle, it won't crash into a "flat" hallucination.
  • Training: We can use UniFuture to create infinite, perfect training scenarios. We can tell it, "Simulate a rainy night where a child runs into the street," and it will generate a video that is not only visually real but physically accurate, helping real cars learn how to react.

In short: UniFuture is the first AI that doesn't just "watch" the future like a movie; it "lives" in the future like a simulation, understanding both the look and the physics of the world simultaneously.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →