DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving

Imagine you are teaching a brilliant, well-read robot how to drive a car. This robot is a Large Language Model (LLM)—think of it as a super-smart librarian who has read every book in the world. It understands stories, traffic laws, and can describe a beautiful sunset perfectly.

However, there's a problem: The robot is terrible at math.

The Problem: The "Word-Counting" Robot

In traditional AI, numbers are treated just like words. If the robot sees the number 3.14, it doesn't see "three point one four." Instead, it sees three separate "tokens" (like puzzle pieces): 3, ., and 14.

To the robot, 3.14 is just a sequence of symbols, like the word "apple." It doesn't inherently understand that 3.14 is bigger than 3.05, or that 10.0 is exactly double 5.0. It's like asking a librarian to compare the weight of two books just by looking at their titles. They might guess, but they often get it wrong.

In autonomous driving, this is dangerous. If the robot thinks a car is moving at 3.14 m/s but actually needs to stop for something at 3.15 m/s, that tiny misunderstanding could lead to a crash. The robot needs to understand numbers as continuous quantities (like a smooth slider on a volume knob), not as broken-up text fragments.

The Solution: DriveCode

The paper introduces DriveCode, a new way to teach this robot to "feel" numbers.

Here is the analogy:

Old Way (Text Tokens): Imagine you are trying to tell the robot the speed of the car. You say, "The speed is three point one four." The robot has to piece these words together to guess the number. It's clunky and imprecise.
DriveCode Way (Continuous Embeddings): Instead of speaking in words, you hand the robot a special, smooth dial that is already set to exactly 3.14. You don't say the words; you just hand over the physical value.

How It Works (The "Translator" and the "Math Head")

The researchers built two special tools to make this happen:

The Number Projector (The Translator):
When the robot reads a prompt like "The car is going 50 mph," the system grabs the number 50 before it turns into a word. It runs it through a special translator (the projector) that turns the raw number into a "math language" the robot understands. This math language is then mixed in with the pictures and the text, so the robot sees the number as a real, physical value, not just a word.
The Number Head (The Math Head):
When the robot needs to answer, "What speed should I go?", it doesn't have to spell out "f-o-u-r" or "f-i-v-e." Instead, it has a dedicated "Math Head" that can simply point to a number on a dial and say, "Go 4.5." It skips the step of breaking the number into letters.

Why This Matters

Think of driving as a tightrope walk.

Without DriveCode: The robot is walking the tightrope while trying to count its steps by reading a book. It's slow, and it might trip because it miscounts a step.
With DriveCode: The robot has a built-in sense of balance. It feels the wind and the rope directly. It can make micro-adjustments instantly because it understands the exact value of its speed and steering angle.

The Results

The researchers tested this on three different driving datasets (like different driving schools).

Accuracy: The robot made fewer mistakes in predicting where the car should go and how fast it should drive.
Speed: Because the robot doesn't have to "spell out" numbers one letter at a time, it can make decisions faster. It's like the difference between writing a number by hand (slow) versus pressing a button that instantly displays the number (fast).

In a Nutshell

DriveCode is like giving a language genius a pair of glasses that lets them see numbers as real, physical objects rather than just words on a page. This allows AI cars to drive more safely, more precisely, and more like a human who intuitively understands speed and distance, rather than a computer that is just guessing based on spelling.

1. Problem Statement

While Large Language Models (LLMs) show promise for end-to-end autonomous driving by integrating perception, planning, and control, they suffer from a fundamental limitation: poor numerical reasoning.

Tokenization Issues: Standard LLMs treat numbers as discrete text tokens (e.g., "3.14" is split into "3", ".", "1", "4"). This approach fails to capture the continuous magnitude and positional significance of digits.
Consequences: This leads to errors in numerical comparisons (e.g., predicting $3.11 > 3.8$) and imprecise generation of control commands.
Safety Criticality: In autonomous driving, small numerical errors in speed, steering angles, or waypoints can propagate through the system, causing unstable trajectories or unsafe maneuvers. The mismatch between the LLM's token-level understanding and the physical world's continuous requirements creates a barrier to reliable deployment.

2. Methodology: DriveCode

The authors propose DriveCode, a novel framework that treats numerical values as a dedicated modality rather than text. Instead of tokenizing numbers, DriveCode processes them as continuous vectors throughout the model's pipeline.

Core Architecture Components

Data Preprocessing:
- Raw text is scanned using regular expressions to identify meaningful physical quantities (speed, distance, time).
- These numbers are replaced with a unified special token <number_token>.
- The original numerical values are extracted into an ordered list that strictly aligns with the sequence of <number_token> placeholders.
Number Projector (Input Side):
- A dedicated module (a two-layer MLP with GELU activation) maps each continuous number $x_k \in \mathbb{R}$ into the language model's hidden embedding space ( $\mathbb{R}^d$ ).
- These numerical embeddings are inserted into the input sequence at the specific positions of the <number_token> placeholders, alongside visual features (from a Vision Transformer) and text embeddings.
- This allows the LLM to attend to numerical data as a continuous signal rather than a sequence of characters.
Number Head (Output Side):
- A parallel output head is introduced alongside the standard Language Model (LM) head.
- When the model predicts a <number_token>, the Number Head performs a regression task to output the continuous scalar value directly from the hidden state.
- This predicted number is then re-embedded via the Number Projector and fed back into the sequence for the next autoregressive step, enabling seamless generation of text and numbers simultaneously.
Loss Function:
- The training objective combines standard Cross-Entropy loss for text tokens ( $L_{text}$ ) and regression losses for numerical outputs ( $L_{num}$ ).
- For scalar control signals (speed, angle), an $\ell_1$ loss is used.
- For trajectory waypoints, an $\ell_2$ distance loss is applied.
- Total Loss: $L = L_{text} + \lambda L_{num}$ (where $\lambda=1$ ).

3. Key Contributions

Dedicated Numerical Modality: The design of a Number Projector that maps continuous numbers into the LLM's hidden space, allowing them to be jointly processed with visual and textual features, bypassing discrete tokenization.
Direct Numerical Regression: The introduction of a Number Head that regresses numerical values directly from hidden states, enabling the model to output precise control signals in a single decoding step rather than generating digit-by-digit text.
Comprehensive Evaluation: Extensive experiments on three diverse autonomous driving datasets (OmniDrive, DriveGPT4, DriveGPT4-V2) demonstrating superior performance in both trajectory prediction and control signal generation compared to baselines.

4. Experimental Results

The authors evaluated DriveCode against baselines including ADAPT (action-aware captioning), DriveGPT4 (standard tokenization), and xVal (scaled embeddings).

Control Signal Prediction (DriveGPT4):
- DriveCode achieved the lowest RMSE for both Speed (1.08 m/s vs. 1.30 m/s for DriveGPT4) and Turning Angle (7.71° vs. 8.98°).
- It showed significant improvements in accuracy thresholds, particularly at moderate to large tolerances (e.g., $A_{5.0}$ reached 99.10% for speed).
Trajectory Prediction (OmniDrive & DriveGPT4-V2):
- On OmniDrive, DriveCode reduced the trajectory L2 error from 3.08m (Text baseline) to 2.83m.
- On DriveGPT4-V2, it achieved the lowest errors in Heading Angle, Point Distance, and Speed compared to xVal and Text baselines.
Efficiency:
- DriveCode reduces inference latency. By predicting a number in a single step rather than generating multiple tokens (e.g., "3", ".", "1", "4"), it lowers the total decoding steps.
- Results showed a reduction in average time per sample from ~3.37s (Text) to 3.18s (DriveCode).

5. Significance and Future Work

Significance: DriveCode addresses a critical gap in applying LLMs to safety-critical physical systems. By aligning the model's internal representation with the continuous nature of physical control signals, it enhances the reliability and precision of end-to-end autonomous driving systems. It proves that treating numbers as a distinct modality is superior to treating them as text.
Limitations: The method relies on accurate extraction and alignment of numbers; mismatches can introduce noise. Performance is also bounded by the base LLM's capabilities.
Future Work: The authors plan to test DriveCode in closed-loop simulation environments to assess real-time driving capabilities and explore multi-scale numerical representations to improve robustness against outliers and varying number scales.

DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving

The Problem: The "Word-Counting" Robot

The Solution: DriveCode

How It Works (The "Translator" and the "Math Head")

Why This Matters

The Results

In a Nutshell

1. Problem Statement

2. Methodology: DriveCode

Core Architecture Components

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization