DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Imagine you are trying to teach a robot to drive a car. To do this, you need to show the robot what the world looks like. Currently, most systems show the robot a pile of separate photos: one from the front, one from the left, one from the right, and so on.

The problem is that this is like giving a chef 100 separate photos of ingredients and asking them to cook a meal. The chef has to mentally stitch the photos together, figure out which apple is in front of which carrot, and guess where the 3D space is. It's messy, slow, and the chef might get confused about what's actually there.

DriveTok is a new invention that solves this by acting like a super-smart translator that turns those messy photos into a single, perfect 3D "Lego blueprint" of the world.

Here is how it works, broken down into simple concepts:

1. The Problem: Too Many Photos, Not Enough Understanding

Current self-driving cars take pictures from many cameras. Existing AI systems treat each picture as a separate 2D puzzle piece. They don't naturally understand that the "car" in the left camera photo is the same object as the "car" in the front camera photo. This makes it hard for the AI to build a true 3D map of the road.

2. The Solution: The "Universal Blueprint" (DriveTok)

DriveTok takes all those separate camera views and compresses them into a single, unified set of digital building blocks called "Scene Tokens."

Think of these tokens like a 3D Lego set of the entire street scene.

Unified: Instead of having 6 different piles of bricks (one for each camera), DriveTok builds one single model.
Resolution Agnostic: It doesn't matter if the camera is high-definition or low-definition; DriveTok still builds the same sturdy Lego model.
Rich Information: These Lego bricks don't just hold shape; they hold color (texture), meaning (is this a pedestrian or a tree?), and depth (how far away is it?).

3. How It Builds the Blueprint

The process happens in three main steps:

Step 1: The Smart Scanner (The Encoder)
DriveTok first looks at the raw photos using a "foundation model" (a super-smart AI that already knows what cars, people, and roads look like). It then uses a special technique called "3D Deformable Attention."
- Analogy: Imagine a spider web that can stretch and shrink. The web reaches out from a central 3D point and "snaps" onto the most important parts of the photos from all cameras simultaneously. It grabs the texture of the road and the shape of a car, merging them into one 3D point.
Step 2: The Truth Filter (Visibility-Guided Attention)
This is the secret sauce. In a real car, a camera on the left can't see what's behind the car on the right.
- Analogy: Imagine a security guard at a museum. If a visitor (a camera view) tries to look at an exhibit (a 3D point) that is blocked by a wall, the guard stops them. DriveTok uses a "visibility mask" to ensure the AI only connects camera views to 3D points that are actually visible. This prevents the AI from getting confused or hallucinating objects that aren't there.
Step 3: The Multi-Task Gym (Joint Training)
To make sure the blueprint is perfect, DriveTok is trained to do four things at once:
1. Reconstruct the Image: Can it redraw the original photo perfectly? (Tests texture).
2. Predict Depth: Can it tell you exactly how far away a tree is? (Tests geometry).
3. Identify Objects: Can it label the tree as a "tree" and the road as "road"? (Tests semantics).
4. 3D Occupancy: Can it fill in the invisible 3D space around the car? (Tests spatial awareness).
By doing all these tasks together, the "Lego bricks" (tokens) become incredibly smart. They know what things look like, what they are, and where they are in 3D space.

4. Why This Matters

Once DriveTok creates this "Universal Blueprint," it can be used for many things:

Better Driving: The car understands the 3D world better, leading to safer decisions.
Future Prediction: Because the blueprint is so rich, the car can imagine "what if?" scenarios (e.g., "What if that pedestrian steps out?").
Efficiency: Instead of processing 6 huge photos, the AI only needs to process one compact set of tokens. It's like sending a compressed zip file instead of a giant hard drive.

The Bottom Line

DriveTok is like a magic translator that turns a chaotic pile of 2D photos into a clean, organized, 3D mental map. It teaches the self-driving car to stop "looking at pictures" and start "understanding the world," making autonomous driving safer, smarter, and more efficient.

1. Problem Statement

The autonomous driving field is shifting from perception-centric pipelines to reasoning-based systems powered by Vision-Language-Action (VLA) models and World Models. A critical bottleneck in this transition is visual scene tokenization.

Limitations of Existing Tokenizers: Current visual tokenizers (e.g., VQ-VAE, VQ-GAN) are designed for monocular, 2D images. They tokenize images independently, ignoring the shared 3D structure of multi-camera setups.
Inefficiency & Inconsistency: Processing high-resolution surround-view cameras independently leads to a massive number of tokens ( $O(N \cdot H \cdot W)$ ), causing computational inefficiency. Furthermore, per-image tokenization results in view-inconsistent tokens that lack spatial alignment and geometric grounding, making them unsuitable for 3D reasoning tasks required by autonomous driving.
The Gap: There is a lack of a tokenizer that can simultaneously provide low-level information (for reconstruction) and high-level semantic/3D information (for understanding) in a unified, resolution-agnostic, and camera-count-agnostic format.

2. Methodology: DriveTok

DriveTok proposes an efficient 3D driving scene tokenizer that transforms multi-view inputs into a fixed number of unified scene tokens. The architecture consists of three main modules:

A. 3D Driving Scene Tokenization (Encoder)

Foundation Model Backbone: Instead of learning features from scratch, DriveTok utilizes a pre-trained Vision Foundation Model (DINOv3-ViTB) combined with an FPN to extract semantically rich, high-level features from surround-view images.
3D Deformable Cross-Attention: The extracted 2D image features are lifted into a unified 3D space. A set of learnable Scene Queries (representing a fixed 3D grid, e.g., $128 \times 128$ $128 \times 128$ ) interacts with image features via 3D deformable cross-attention.
- This mechanism projects 3D points to 2D image planes using camera intrinsics/extrinsics.
- It aggregates features from all cameras into the scene queries, creating Unified Scene Tokens that are agnostic to the number of cameras and input resolution.
- Positional encodings are added to encode metric coordinates.

B. Spatial-Aware Multi-View Decoder

Token Interaction: The decoder uses a Multi-View Transformer to facilitate interaction between the Unified Scene Tokens (3D space) and View Tokens (2D image patches).
Plücker Ray Embeddings: View tokens are enhanced with Plücker ray embeddings, which encode the 3D ray direction and origin, providing a camera-aware geometric prior to distinguish tokens with similar 2D appearances but different viewpoints.
Visibility-Guided Attention: A key innovation is the visibility mask. The transformer only allows attention between a scene token and a view token if the 3D scene cell is physically visible from that camera. This prevents the model from learning spurious correlations and enforces geometric consistency.

C. Unified Reconstruction and Understanding (Multi-Task Training)

DriveTok employs a joint training strategy with five objectives to ensure the scene tokens encode texture, geometry, and semantics:

Image Reconstruction: Reconstructs RGB images from view tokens using a DPT decoder. Loss includes L1, LPIPS, and GAN losses.
Depth Prediction: Predicts metric depth using pseudo-labels generated by MoGe-2 and aligned with sparse LiDAR points via Robust Outlier Exclusion (ROE).
Semantic Prediction: Predicts semantic segmentation maps using sparse LiDARSeg projections.
3D Occupancy Prediction: A dedicated 3D head predicts voxel-wise semantic occupancy directly from the scene tokens, enforcing 3D structural understanding.
Semantic Regularization: Explicitly aligns the latent scene tokens with semantic occupancy labels in the latent space to prevent structural corruption.

3. Key Contributions

Unified Scene Tokenization Framework: DriveTok is the first tokenizer designed specifically for multi-view driving scenes that produces resolution- and camera-agnostic scene tokens. It decouples the token count from input resolution and camera count.
Geometry-Aware Architecture: By integrating 3D deformable cross-attention in the encoder and visibility-guided attention in the decoder, the model ensures that the learned tokens are spatially consistent and geometrically grounded.
Joint Multi-Task Learning: The framework unifies low-level reconstruction (RGB, Depth) and high-level understanding (Semantic Segmentation, 3D Occupancy) into a single tokenizer, creating a rich representation suitable for downstream VLA and World Models.
Semantic Regularization: The introduction of a latent-space semantic regularization term ensures that the compressed scene tokens retain explicit semantic structure, avoiding the "radial patterns" often seen in conventional BEV methods.

4. Experimental Results

Experiments were conducted on the nuScenes dataset (6 surround-view cameras).

Image Reconstruction: DriveTok achieves competitive PSNR (27.89 dB) and SSIM (0.747) compared to state-of-the-art tokenizers (e.g., VQGAN, FlowMo), despite handling multi-view inputs. It maintains cross-view consistency in overlapping fields of view.
Depth Prediction: DriveTok significantly outperforms both monocular (e.g., UniDepthV2, Metric3D) and multi-view depth predictors.
- AbsRel: 0.08 (vs. 0.23–0.52 for others).
- $\delta < 1.25$ : 0.93 (vs. 0.30–0.81 for others).
3D Semantic Occupancy: DriveTok achieves State-of-the-Art (SOTA) performance in 3D occupancy prediction (IoU: 33.32, mIoU: 20.06), surpassing specialized models like SurroundOcc, GaussianFormer, and QuadricFormer, even with a frozen backbone.
Ablation Studies:
- Removing visibility-guided attention causes a drastic drop in 3D performance (mIoU drops from 20.06 to 3.84), proving its necessity for geometric reasoning.
- Joint Training: Adding depth and occupancy tasks improves 3D understanding at a slight cost to image reconstruction quality, validating the trade-off for better reasoning capabilities.

5. Significance and Future Impact

Bridge to World Models: DriveTok provides a compact, semantically rich, and geometrically consistent interface for Vision-Language-Action (VLA) models and World Models. It solves the "token explosion" problem of high-res multi-view inputs.
Unified Representation: By encoding texture, geometry, and semantics into a single token set, it enables downstream tasks like open-ended question answering, counterfactual reasoning, and multi-step planning without needing separate perception modules.
Scalability: The resolution-agnostic nature of the tokens allows the system to scale to different sensor configurations without retraining the tokenizer architecture.

In summary, DriveTok redefines how autonomous driving systems represent visual data, moving from fragmented 2D patches to a cohesive, 3D-aware "scene language" that supports both high-fidelity reconstruction and complex reasoning.