DAP: A Discrete-token Autoregressive Planner for Autonomous Driving

Imagine you are teaching a robot to drive a car. For a long time, the best way to do this was to show the robot thousands of hours of human driving videos and say, "Copy exactly what the human did." This is called Imitation Learning.

But here's the problem: If the robot just copies the human, it doesn't really understand the road. If the human makes a tiny mistake, the robot copies it. If the human gets distracted, the robot gets distracted. It's like a student who memorizes the answers to a math test but doesn't understand the math. If the test changes slightly, the student fails.

The paper you shared introduces DAP (Discrete-token Autoregressive Planner), a new way to teach self-driving cars that is smarter, more efficient, and more like how a human brain actually thinks.

Here is the breakdown of how DAP works, using simple analogies:

1. The "Discrete Token" Idea: Turning the World into Lego Bricks

Most self-driving AI tries to understand the world as a continuous, blurry stream of pixels (like a high-definition video). This is heavy and hard to process.

DAP's approach: It turns the entire driving scene into a set of Lego bricks (called "discrete tokens").

Instead of seeing a smooth curve of a road, it sees "Block A," "Block B," "Block C."
Instead of a smooth steering angle, it sees "Turn Left a little," "Go Straight," "Turn Right a little."

Why is this cool? Just like how Large Language Models (like the one you are talking to right now) turn words into tokens to write stories, DAP turns the driving world into tokens to "write" a driving plan. This makes the math much simpler and allows the model to learn faster and scale up easily.

2. The "Autoregressive" Part: Reading the Future One Word at a Time

There are two ways to predict the future:

The Old Way (Non-Autoregressive): The AI looks at the current scene and tries to spit out the entire next 5 seconds of driving in one giant burst. It's like trying to guess the ending of a movie before you've seen the middle.
The DAP Way (Autoregressive): The AI predicts the next split second, then uses that prediction to guess the next split second, and so on. It's like reading a book one word at a time. You read word 1, then use that to understand word 2, then word 3.

The Benefit: This creates a chain of logic. If the car predicts a pedestrian is about to step out in the next second, it can use that information to decide how to brake in the second after that. It builds a story of the future, step-by-step.

3. The Secret Sauce: "World Modeling" (Predicting the Scene AND the Car)

This is the biggest innovation. Most planners only predict: "Where will my car go?"

DAP predicts two things simultaneously:

Where the car will go.
How the world around the car will change.

The Analogy: Imagine you are playing chess.

A bad player only thinks: "I will move my knight here."
A grandmaster thinks: "I will move my knight here, AND I predict my opponent will move their pawn there, AND then I will move my bishop."

DAP does this for driving. It predicts the future traffic, the pedestrians, and the road signs (the "World") at the same time as it plans its own moves. By forcing the AI to predict the environment, it learns why it needs to move the way it does. It's no longer just copying; it's understanding cause and effect.

4. The "Coach" (Reinforcement Learning)

The paper mentions a second stage of training called SAC-BC.

Stage 1 (Imitation): The robot watches a human drive and learns the basics.
Stage 2 (The Coach): The robot drives in a simulator and gets a "score."
- Did you hit a wall? Bad score.
- Did you drive smoothly? Good score.
- Did you stay in the lane? Good score.

Even if the human driver made a risky move, the "Coach" tells the robot, "Actually, that was dangerous. Don't do that." This teaches the robot to be safer than the human it was copying.

5. Why is this a Big Deal? (Efficiency)

Usually, to get a robot to be this smart, you need a massive computer brain (like a supercomputer with billions of parameters).

DAP is tiny.

It uses only 120 million parameters.
For context, many other state-of-the-art models use billions of parameters.
The Result: DAP is as smart as the giants but runs on a much smaller, cheaper, and faster computer. It's like fitting a Ferrari engine into a compact car.

Summary

DAP is a self-driving planner that:

Simplifies the world into Lego-like blocks (Tokens).
Reads the future step-by-step (Autoregressive).
Predicts the environment along with its own moves (World Modeling), so it understands why it's driving.
Gets coached by a reward system to be safer than the humans it learned from.
Does all this with a tiny, efficient brain that doesn't need a supercomputer.

It's a shift from "blindly copying" to "understanding and predicting," making self-driving cars that are not just smart, but also safe and efficient.

1. Problem Statement

Autonomous driving planning faces a critical challenge: achieving sustainable performance improvements through scaling data and model size while maintaining computational efficiency.

Supervision Sparsity: Existing autoregressive (AR) models often predict only ego trajectories. This leads to sparse supervision (only waypoint labels) and fails to explicitly constrain how scene evolution should influence ego motion, resulting in weak coupling between environment understanding and planning.
Scaling Inefficiency: Non-autoregressive methods (e.g., diffusion models, direct mapping) often struggle with the scaling laws observed in Large Language Models (LLMs), where decoder-only transformers trained on discrete tokens show predictable power-law improvements with increased compute and data.
Imitation Limitations: Pure Imitation Learning (IL) tends to overfit to expert demonstrations, leading to "mode averaging" or suboptimal lane choices that lack safety margins, especially under covariate shift or out-of-distribution (OOD) conditions.

2. Methodology: DAP Architecture

The authors propose DAP, a Discrete-token Autoregressive Planner that treats motion forecasting and planning as a sequence modeling task using a decoder-only Transformer with sparse Mixture of Experts (MoE).

A. Core Architecture & Tokenization

DAP operates on a fully discrete token space, jointly forecasting Bird's Eye View (BEV) semantics and ego trajectory actions.

Input Tokenization:
- Command: Routing commands are converted to one-hot vectors.
- BEV Semantics: Multi-view camera inputs are fused into a semantic BEV feature map, then discretized into environment tokens using a VQ-VAE (Vector Quantized Variational Autoencoder).
- Ego Trajectory: Continuous ego states (position, yaw) are converted into curvature ( $\kappa$ ) and acceleration ( $a$ ) pairs, which are then discretized into action tokens.
Decoder-Only Transformer:
- The model uses a causal decoder-only Transformer with sparse MoE layers to handle diverse traffic patterns efficiently.
- Joint Forecasting: At each timestep, the model predicts a block of future BEV tokens (scene evolution) and a single action token (ego motion).
- Attention Mechanism: To accelerate inference, the model employs bidirectional intra-step attention for BEV tokens within the same timestep (allowing parallel generation of the scene block), while maintaining causal attention between timesteps and for the action token (which conditions on the newly generated BEV tokens). This ensures motion generation is explicitly conditioned on the predicted future scene.

B. Training Strategy: Two-Stage Optimization

To address the limitations of pure imitation learning, DAP utilizes a two-stage training pipeline:

Stage I: Supervised Pre-training (Behavior Cloning):
- The model is trained using a weighted sum of Cross-Entropy losses for both BEV tokens and trajectory tokens.
- Scheduled Sampling is used to mitigate exposure bias, interpolating between ground-truth and model-generated tokens during training.
- This stage establishes a strong prior for scene understanding and trajectory generation.
Stage II: Offline RL Fine-tuning (SAC-BC):
- The authors adopt SAC-BC (Soft Actor-Critic + Behavior Cloning).
- Reward Signal: A reward function is designed based on safety (distance to lane center and obstacles) and comfort (jerk and angular acceleration).
- Mechanism: The policy is updated to maximize rewards while regularizing toward the Behavior Cloning (BC) prior. This breaks the symmetry of the loss function (where safe and unsafe trajectories might have similar IL losses) and encourages risk-aware decision-making without abandoning the learned priors.

C. Trajectory Post-Tuning

A lightweight, rule-based post-tuning module is applied to the discrete output. It uses gradient ascent on lane likelihood maps and finite-difference regularization to smooth lateral and longitudinal trajectories, reducing jitter and improving ride comfort without modifying the core planner.

3. Key Contributions

Discrete-Token Autoregressive Planning: DAP is the first to propose a fully discrete-token, decoder-only autoregressive planner that jointly models scene evolution (BEV) and ego motion, leveraging the favorable scaling laws of LLMs.
Joint Environment-Trajectory Forecasting: By predicting BEV semantics and $\kappa$ - $a$ actions simultaneously, the model provides dense, spatio-temporally aligned supervision. This tightly couples scene understanding with motion generation, improving multi-step credit assignment.
SAC-BC Fine-Tuning: The integration of offline RL (SAC-BC) allows the model to learn safety and comfort preferences beyond simple trajectory matching, effectively handling covariate shifts and avoiding hazardous modes.
Efficiency and Performance: Despite a compact 120M parameter budget, DAP achieves state-of-the-art (SOTA) performance. The use of bidirectional intra-step attention enables low-latency inference (~100ms per sample).

4. Experimental Results

The model was evaluated on nuScenes, NuPlan, and the NAVSIM benchmarks (v1 and v2).

Open-Loop Metrics (nuScenes & NuPlan):
- nuScenes: DAP achieves the best L2max (0.21m) and competitive L2avg (0.27m), outperforming larger models like UniAD and OpenDriveVLA.
- NuPlan: DAP sets new SOTA on 8s ADE (1.202m) and OLS (91.68%) across multiple splits, demonstrating superior distribution-level accuracy and reliability.
Closed-Loop Metrics (NAVSIM v1 & v2):
- PDMS (v1): DAP achieves a score of 90.0, matching or exceeding camera-only methods with significantly fewer parameters (e.g., outperforming DriveVLA-W0 which uses a VLM backbone with billions of parameters). It achieves perfect comfort scores (C=100.0).
- EPDMS (v2): DAP scores 85.6, significantly outperforming the ego-status baseline (64.0) and remaining competitive with heavy learning-based planners. It leads in progress (EP) and history comfort (HC).
Ablation Studies:
- Removing BEV prediction (trajectory-only) drops PDMS from 90.0 to 82.8, proving the necessity of joint supervision.
- SAC-BC fine-tuning consistently improves performance over pure BC.
- Scaling data (up to 80k samples) and increasing BEV token fidelity (finer downsampling, larger codebook) yields monotonic performance gains.

5. Significance

DAP represents a paradigm shift in autonomous driving planning by successfully applying LLM-style discrete token modeling to the physical world.

Scalability: It demonstrates that compact, discrete-token autoregressive models can scale efficiently with data, offering a more parameter-efficient alternative to massive VLM-based or diffusion-based planners.
World Modeling: By jointly forecasting the environment and the agent's actions, DAP effectively implements a lightweight "world model" in a latent BEV space, enabling robust rollouts and better handling of complex, dynamic scenes.
Practicality: The combination of high performance, low latency, and a small parameter footprint makes DAP a highly viable candidate for real-world deployment in resource-constrained autonomous vehicles.