SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

This paper proposes SToRM, a novel framework that employs a lightweight importance predictor, supervised training with pseudo-labels, and an anchor-context merging module to significantly reduce visual token redundancy in multi-modal LLMs for autonomous driving, achieving up to 30x computational savings while maintaining end-to-end performance comparable to using all tokens.

Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Hogun Park, Il Yong Chun

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are teaching a robot to drive a car. You want this robot to be able to look at the road, listen to your voice commands (like "turn left at the next red light"), and instantly decide how to steer and brake. This is called End-to-End Autonomous Driving.

To make the robot smart enough to handle weird situations (like a dog running into the street), researchers gave it a "brain" called a Multi-modal Large Language Model (MLLM). Think of this brain as a super-intelligent librarian who can read text, look at pictures, and understand context all at once.

The Problem: The Brain is Too Busy

Here's the catch: To understand the road, the robot takes hundreds of tiny snapshots (called visual tokens) from its cameras every second.

  • The Analogy: Imagine you are trying to read a book, but instead of reading one word at a time, you are forced to read every single letter, every space, and every punctuation mark on every page simultaneously.
  • The Result: The robot's brain gets overwhelmed. It has to process thousands of these tiny pieces of information at once. This makes the car slow to react (laggy) and requires a massive, expensive computer to run. In a real car, you need something fast and lightweight, not a supercomputer.

Previous attempts to fix this were like telling the robot, "Just ignore the boring parts." But the robot often ignored the wrong things, like a pedestrian in the distance, leading to unsafe driving.

The Solution: SToRM (The Smart Filter)

The authors of this paper propose a new system called SToRM (Supervised Token Reduction). Think of SToRM as a super-efficient personal assistant for the robot's brain.

Here is how SToRM works, using three simple steps:

1. The "Importance Predictor" (The Smart Scout)

Instead of reading every single letter, the robot needs a scout to tell it which letters matter most.

  • How it works: The researchers trained a tiny, lightweight "scout" module. This scout looks at the road for just a split second (a short time window) and asks, "What is happening right now that I need to pay attention to?"
  • The Magic: It doesn't guess randomly. It was trained by watching the "big brain" (the full model) work. It learned to mimic the big brain's attention. If the big brain stares at a stop sign, the scout learns to say, "Hey, that stop sign is important!"

2. The "Anchor-Context" Merging (The Grouping Game)

Once the scout identifies the important things, the system organizes the information.

  • The Analogy: Imagine you are packing for a trip. You have a suitcase full of clothes (the visual tokens).
    • Anchors: These are your essentials (passport, wallet, keys). You keep these separate and safe.
    • Context: These are the extra socks, t-shirts, and accessories.
  • The Trick: Instead of throwing away the extra clothes, SToRM merges them into the essentials. It folds the socks inside the pockets of the jacket. You still have the information (the socks), but you don't need to carry a separate pile of them.
  • In the car: The robot keeps the "Anchors" (the pedestrian, the traffic light, the lane lines) and folds the "Context" (the texture of the road, the shadows, the sky) into them. This drastically reduces the number of items the brain has to process.

3. The "Teacher" (Pseudo-Supervision)

How did the scout learn to be so good?

  • The Analogy: Imagine a student (the scout) trying to learn how to grade essays. Instead of having a teacher grade every single essay from scratch, the student watches the teacher grade a few essays, learns the pattern of what the teacher thinks is important, and then practices grading on their own.
  • In the paper: The system runs a "practice round" where it processes everything (all tokens) to see what the big brain focuses on. It uses those results as a "cheat sheet" (pseudo-supervision) to train the lightweight scout. Now, the scout can do the job of the big brain but much faster.

Why This Matters

  • Speed: The robot can now make decisions in real-time, even on a standard computer chip, because it's not drowning in data.
  • Safety: Unlike previous methods that just threw away data randomly, SToRM keeps the critical information. It's like a filter that removes the noise but keeps the signal.
  • Efficiency: The paper shows that SToRM makes the car 30 times faster computationally while driving just as well as the slow, heavy version.

In a nutshell: SToRM teaches the self-driving car to be a smart editor. It doesn't just delete information; it learns what to keep, what to summarize, and how to combine the rest, so the car can drive safely and quickly without needing a supercomputer.