DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

DriveMamba proposes a task-centric, scalable state space model for efficient end-to-end autonomous driving that replaces the sequential Transformer-based paradigm with a unified Mamba decoder featuring linear-complexity operators and bidirectional trajectory-guided scanning to overcome information loss, cumulative errors, and computational inefficiencies in handling spatiotemporal inputs.

Haisheng Su, Wei Wu, Feixiang Song, Junjie Zhang, Zhenjie Yang, Junchi Yan

Published 2026-02-25
📖 5 min read🧠 Deep dive

Imagine you are teaching a brand new driver how to navigate a busy city.

The Old Way (The "Assembly Line" Problem)
Most current self-driving systems work like a rigid factory assembly line.

  1. Station 1 (Perception): A robot looks out the window and says, "I see a red car."
  2. Station 2 (Prediction): A second robot takes that note and says, "The red car is moving left."
  3. Station 3 (Planning): A third robot takes that info and says, "Okay, I'll turn right."

The problem? If the first robot makes a tiny mistake (like misjudging the red car's speed), that error gets passed down the line, gets magnified, and the final decision might be dangerous. Also, this "assembly line" is slow because every station has to wait for the previous one to finish. It's like trying to run a marathon while waiting for a slow friend to tie their shoe at every mile marker.

The New Way: DriveMamba (The "Super-Organized Brain")
The authors of this paper, DriveMamba, propose a completely different approach. Instead of an assembly line, imagine a super-organized brain that looks at everything at once and makes decisions instantly.

Here is how it works, using simple analogies:

1. The "Token" Menu (No More Heavy Lifting)

Old systems try to build a massive, 3D map of the entire world (called a "Dense BEV") before making a decision. It's like trying to draw the entire city of New York in perfect detail before deciding where to turn. It takes too much time and memory.

DriveMamba is smarter. It treats every piece of information (a car, a pedestrian, a lane line) as a simple "token" or a sticky note.

  • Instead of drawing the whole city, it just writes down: "Car at 5 o'clock," "Lane at 12 o'clock," "Myself at center."
  • It organizes these sticky notes into a neat list based on where they are in space and time. This makes the system incredibly fast and lightweight.

2. The "Mamba" (The Efficient Reader)

The core of this system is a new type of AI engine called Mamba.

  • The Old Engine (Transformer): Imagine reading a book where, to understand the last sentence, you have to re-read every single previous sentence to see how they connect. As the book gets longer, this gets exhausting and slow. This is how older self-driving cars work; they get bogged down as the scene gets complex.
  • The Mamba Engine: Imagine a reader who has a magical memory. They can read a long story, remember the important parts, and instantly understand the context without re-reading the whole thing. They process information in a straight line (linear complexity).
  • The Result: DriveMamba can "read" a long, complex driving scene (like a 20-second video of traffic) just as fast as a short one. It doesn't get tired or slow down.

3. The "Trajectory-Guided" Scan (The Driver's Gaze)

This is the paper's "secret sauce."

  • Old systems look at the whole world equally. They stare at a parked car on the other side of the street just as hard as the car directly in front of them.
  • DriveMamba mimics a human driver's eyes. It uses a "Local-to-Global" scan.
    • First, it focuses intensely on the immediate path (the "Local" part).
    • Then, it expands its view to the surroundings (the "Global" part).
    • Crucially, it follows a predicted path (a "trajectory"). If the car plans to turn left, the system prioritizes looking at the left lane and the cars there, ignoring the irrelevant stuff on the right. It's like a spotlight following your intended path.

4. The "Unified" Brain (No More Silos)

In the old assembly line, the "Perception" team and the "Planning" team rarely talk.
In DriveMamba, everything happens in one single room.

  • The system learns how a pedestrian's movement (Prediction) directly affects where the car should steer (Planning) simultaneously.
  • It realizes, "Oh, that pedestrian is stepping out, so I need to slow down now," without waiting for a separate step to finish. This creates a much smoother, safer, and more reactive driving style.

Why Does This Matter?

  • Speed: It runs 10 times faster than some of the best previous systems. It can make decisions in milliseconds, which is crucial for avoiding accidents.
  • Efficiency: It uses much less computer power (memory), meaning it could eventually run on the standard computer inside your car, not just a supercomputer.
  • Safety: By looking at the whole picture at once and prioritizing what matters (the path ahead), it makes fewer mistakes and handles complex, chaotic traffic better.

In a nutshell:
If previous self-driving cars were like a slow, clunky assembly line that got confused by long traffic jams, DriveMamba is like a highly focused, lightning-fast driver who knows exactly where to look, remembers everything important, and makes split-second decisions without getting overwhelmed. It's a giant leap toward making self-driving cars that are as safe and efficient as a human expert driver.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →