SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

Imagine you are teaching a robot to drive a car. You want this robot to be able to look at the road, listen to your voice commands (like "turn left at the next red light"), and instantly decide how to steer and brake. This is called End-to-End Autonomous Driving.

To make the robot smart enough to handle weird situations (like a dog running into the street), researchers gave it a "brain" called a Multi-modal Large Language Model (MLLM). Think of this brain as a super-intelligent librarian who can read text, look at pictures, and understand context all at once.

The Problem: The Brain is Too Busy

Here's the catch: To understand the road, the robot takes hundreds of tiny snapshots (called visual tokens) from its cameras every second.

The Analogy: Imagine you are trying to read a book, but instead of reading one word at a time, you are forced to read every single letter, every space, and every punctuation mark on every page simultaneously.
The Result: The robot's brain gets overwhelmed. It has to process thousands of these tiny pieces of information at once. This makes the car slow to react (laggy) and requires a massive, expensive computer to run. In a real car, you need something fast and lightweight, not a supercomputer.

Previous attempts to fix this were like telling the robot, "Just ignore the boring parts." But the robot often ignored the wrong things, like a pedestrian in the distance, leading to unsafe driving.

The Solution: SToRM (The Smart Filter)

The authors of this paper propose a new system called SToRM (Supervised Token Reduction). Think of SToRM as a super-efficient personal assistant for the robot's brain.

Here is how SToRM works, using three simple steps:

1. The "Importance Predictor" (The Smart Scout)

Instead of reading every single letter, the robot needs a scout to tell it which letters matter most.

How it works: The researchers trained a tiny, lightweight "scout" module. This scout looks at the road for just a split second (a short time window) and asks, "What is happening right now that I need to pay attention to?"
The Magic: It doesn't guess randomly. It was trained by watching the "big brain" (the full model) work. It learned to mimic the big brain's attention. If the big brain stares at a stop sign, the scout learns to say, "Hey, that stop sign is important!"

2. The "Anchor-Context" Merging (The Grouping Game)

Once the scout identifies the important things, the system organizes the information.

The Analogy: Imagine you are packing for a trip. You have a suitcase full of clothes (the visual tokens).
- Anchors: These are your essentials (passport, wallet, keys). You keep these separate and safe.
- Context: These are the extra socks, t-shirts, and accessories.
The Trick: Instead of throwing away the extra clothes, SToRM merges them into the essentials. It folds the socks inside the pockets of the jacket. You still have the information (the socks), but you don't need to carry a separate pile of them.
In the car: The robot keeps the "Anchors" (the pedestrian, the traffic light, the lane lines) and folds the "Context" (the texture of the road, the shadows, the sky) into them. This drastically reduces the number of items the brain has to process.

3. The "Teacher" (Pseudo-Supervision)

How did the scout learn to be so good?

The Analogy: Imagine a student (the scout) trying to learn how to grade essays. Instead of having a teacher grade every single essay from scratch, the student watches the teacher grade a few essays, learns the pattern of what the teacher thinks is important, and then practices grading on their own.
In the paper: The system runs a "practice round" where it processes everything (all tokens) to see what the big brain focuses on. It uses those results as a "cheat sheet" (pseudo-supervision) to train the lightweight scout. Now, the scout can do the job of the big brain but much faster.

Why This Matters

Speed: The robot can now make decisions in real-time, even on a standard computer chip, because it's not drowning in data.
Safety: Unlike previous methods that just threw away data randomly, SToRM keeps the critical information. It's like a filter that removes the noise but keeps the signal.
Efficiency: The paper shows that SToRM makes the car 30 times faster computationally while driving just as well as the slow, heavy version.

In a nutshell: SToRM teaches the self-driving car to be a smart editor. It doesn't just delete information; it learns what to keep, what to summarize, and how to combine the rest, so the car can drive safely and quickly without needing a supercomputer.

Here is a detailed technical summary of the paper "SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving."

1. Problem Statement

End-to-end (E2E) autonomous driving systems that predict control commands directly from sensor data have advanced significantly. However, integrating Multi-modal Large Language Models (MLLMs) to handle complex, unforeseen scenarios via natural language instructions introduces severe computational bottlenecks:

Token Explosion: Processing historical sensor data (e.g., multiple frames from cameras and LiDAR) generates a massive number of visual tokens.
Quadratic Complexity: The attention mechanisms in LLM backbones scale quadratically with input length, causing high latency and memory usage.
Performance Trade-off: Existing token reduction methods (e.g., Q-Former, pruning based on similarity) reduce computational cost but often degrade driving performance compared to using all tokens.
Real-time Constraints: Autonomous vehicles require real-time inference on resource-constrained hardware, which current MLLM approaches struggle to meet without sacrificing safety or performance.

2. Methodology: The SToRM Framework

The authors propose SToRM (Supervised Token Reduction for Multi-modal LLMs), the first framework to use supervised learning for token reduction in E2E driving. The core philosophy is to use intermediate attention scores from a full-token pass as "pseudo-supervision" to train a lightweight predictor that identifies which tokens are critical.

The framework consists of three main components:

A. Pseudo-Supervision via Auxiliary Path

To train the token reduction mechanism without manual labels, SToRM employs an auxiliary training path:

Full-Pass: Visual and text tokens are fed into a frozen LLM without any reduction.
Signal Extraction: The attention scores from the LLM's last decoder layer are extracted.
Pseudo-Labels: The column-wise mean of the attention matrix is calculated to generate pseudo-importance scores for each visual token. Tokens receiving higher attention are deemed more critical for the downstream task.

B. Lightweight Importance Predictor

Instead of using heavy Transformer blocks, SToRM introduces a lightweight MLP-Mixer based predictor designed for efficiency:

Short-Term Spatio-Temporal Mixing: It uses sliding windows over the temporal sequence of frames (rather than processing the entire sequence at once) to capture local temporal dependencies. This reduces complexity from $O((TN)^2)$ to $O((2\ell+1)^2 \cdot T)$ .
Channel Mixing: It models intra-token cross-channel dependencies to enrich feature representation.
Output: This module outputs a scalar importance score for every visual token.

C. Anchor-Context Merging (ACM) Module

Based on the predicted importance scores, the ACM module reduces the token count:

Categorization: Tokens are ranked and split into two groups:
- Anchors: Top- $K$ tokens with the highest importance scores (critical evidence).
- Context: The remaining tokens (redundant or complementary details).
Hard Merging: Context tokens are merged into their single most relevant anchor token.
- An assignment matrix is generated using Gumbel-Softmax to approximate a one-hot selection (assigning each context token to exactly one anchor).
- A Straight-Through Estimator (STE) is used to allow gradients to flow through this non-differentiable selection process during backpropagation.
- The anchor representations are updated by adding the weighted context features, preserving critical information while drastically reducing the token count.

3. Key Contributions

First Supervised Token Reduction: Introduced the first framework for MLLMs in E2E driving that uses pseudo-supervision signals (derived from LLM attention) to guide token reduction, avoiding the performance drop seen in heuristic methods.
Efficient Architecture: Proposed a lightweight MLP-Mixer with sliding windows that captures short-term spatio-temporal relations, significantly lowering computational overhead compared to standard Transformers.
Anchor-Context Merging (ACM): Developed a novel merging strategy that partitions tokens into "anchors" and "context," merging the latter into the former via hard assignment to minimize information loss.
End-to-End Training: The entire system (predictor + merging + LLM) is trained jointly to optimize both waypoint prediction and token importance estimation.

4. Experimental Results

Experiments were conducted on the LangAuto benchmark using two LLM backbones (LLaVA-7B and TinyLLaVA-1.5B).

Performance vs. Efficiency:
- SToRM achieved driving performance comparable to the "All-Token" baseline (using 3,000 tokens) while using only 120 tokens.
- It outperformed the previous State-of-the-Art (SOTA) LMDrive (which uses Q-Former) by a significant margin in Driving Score (DS), Route Completion (RC), and Infraction Score (IS).
Computational Gains:
- 30× reduction in FLOPs for the large LLM backbone.
- 16.6× reduction for the tiny backbone.
- Real-time Inference: Achieved 25–36 FPS on a standard NVIDIA RTX 4090 GPU, whereas the all-token baseline struggled to reach 6 FPS.
Comparison with SOTA Reduction Methods:
- SToRM outperformed seven other SOTA token reduction methods (including ToMe, LLaVA-PruMerge, VisionZip, and HiCom) under the same token budget (120 tokens).
Ablation Studies:
- Confirmed that the sliding window mechanism is crucial for efficiency without sacrificing accuracy.
- Demonstrated that hard merging (assigning context to a single anchor) outperforms soft merging or simply discarding context tokens.

5. Significance

SToRM addresses a critical bottleneck in deploying MLLMs for autonomous driving. By proving that supervised token reduction can maintain high-level reasoning capabilities while drastically cutting computational costs, it enables:

Real-time E2E driving on standard consumer-grade GPUs.
Safe human-vehicle interaction via natural language instructions in complex scenarios without latency-induced failures.
A new paradigm for efficient MLLM deployment in resource-constrained edge devices, moving beyond heuristic pruning to task-aware, learned compression.