DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

DriveTok proposes an efficient 3D scene tokenizer that leverages vision foundation features and 3D deformable cross-attention to generate unified tokens for simultaneous multi-view reconstruction and understanding, achieving strong performance across image reconstruction, semantic segmentation, depth prediction, and 3D occupancy tasks on the nuScenes dataset.

Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu

Published 2026-03-20
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to drive a car. To do this, you need to show the robot what the world looks like. Currently, most systems show the robot a pile of separate photos: one from the front, one from the left, one from the right, and so on.

The problem is that this is like giving a chef 100 separate photos of ingredients and asking them to cook a meal. The chef has to mentally stitch the photos together, figure out which apple is in front of which carrot, and guess where the 3D space is. It's messy, slow, and the chef might get confused about what's actually there.

DriveTok is a new invention that solves this by acting like a super-smart translator that turns those messy photos into a single, perfect 3D "Lego blueprint" of the world.

Here is how it works, broken down into simple concepts:

1. The Problem: Too Many Photos, Not Enough Understanding

Current self-driving cars take pictures from many cameras. Existing AI systems treat each picture as a separate 2D puzzle piece. They don't naturally understand that the "car" in the left camera photo is the same object as the "car" in the front camera photo. This makes it hard for the AI to build a true 3D map of the road.

2. The Solution: The "Universal Blueprint" (DriveTok)

DriveTok takes all those separate camera views and compresses them into a single, unified set of digital building blocks called "Scene Tokens."

Think of these tokens like a 3D Lego set of the entire street scene.

  • Unified: Instead of having 6 different piles of bricks (one for each camera), DriveTok builds one single model.
  • Resolution Agnostic: It doesn't matter if the camera is high-definition or low-definition; DriveTok still builds the same sturdy Lego model.
  • Rich Information: These Lego bricks don't just hold shape; they hold color (texture), meaning (is this a pedestrian or a tree?), and depth (how far away is it?).

3. How It Builds the Blueprint

The process happens in three main steps:

  • Step 1: The Smart Scanner (The Encoder)
    DriveTok first looks at the raw photos using a "foundation model" (a super-smart AI that already knows what cars, people, and roads look like). It then uses a special technique called "3D Deformable Attention."

    • Analogy: Imagine a spider web that can stretch and shrink. The web reaches out from a central 3D point and "snaps" onto the most important parts of the photos from all cameras simultaneously. It grabs the texture of the road and the shape of a car, merging them into one 3D point.
  • Step 2: The Truth Filter (Visibility-Guided Attention)
    This is the secret sauce. In a real car, a camera on the left can't see what's behind the car on the right.

    • Analogy: Imagine a security guard at a museum. If a visitor (a camera view) tries to look at an exhibit (a 3D point) that is blocked by a wall, the guard stops them. DriveTok uses a "visibility mask" to ensure the AI only connects camera views to 3D points that are actually visible. This prevents the AI from getting confused or hallucinating objects that aren't there.
  • Step 3: The Multi-Task Gym (Joint Training)
    To make sure the blueprint is perfect, DriveTok is trained to do four things at once:

    1. Reconstruct the Image: Can it redraw the original photo perfectly? (Tests texture).
    2. Predict Depth: Can it tell you exactly how far away a tree is? (Tests geometry).
    3. Identify Objects: Can it label the tree as a "tree" and the road as "road"? (Tests semantics).
    4. 3D Occupancy: Can it fill in the invisible 3D space around the car? (Tests spatial awareness).

    By doing all these tasks together, the "Lego bricks" (tokens) become incredibly smart. They know what things look like, what they are, and where they are in 3D space.

4. Why This Matters

Once DriveTok creates this "Universal Blueprint," it can be used for many things:

  • Better Driving: The car understands the 3D world better, leading to safer decisions.
  • Future Prediction: Because the blueprint is so rich, the car can imagine "what if?" scenarios (e.g., "What if that pedestrian steps out?").
  • Efficiency: Instead of processing 6 huge photos, the AI only needs to process one compact set of tokens. It's like sending a compressed zip file instead of a giant hard drive.

The Bottom Line

DriveTok is like a magic translator that turns a chaotic pile of 2D photos into a clean, organized, 3D mental map. It teaches the self-driving car to stop "looking at pictures" and start "understanding the world," making autonomous driving safer, smarter, and more efficient.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →