DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds

This paper introduces DRIFT, a dual-path Transformer model that effectively fuses fine-grained local and coarse-grained global features from sparse 4D radar point clouds to achieve state-of-the-art performance in automated driving perception tasks like object detection and free road estimation.

Siqi Pei, Andras Palffy, Dariu M. Gavrila

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to navigate a car through a heavy fog. You have two tools to help you see:

  1. A Super-Sharp Camera (LiDAR): It gives you a crystal-clear, high-definition 3D map of everything around you. But it's expensive, and if it rains or gets foggy, it gets confused.
  2. A Weather-Proof Radar: It's cheap, works great in rain and fog, and even tells you how fast things are moving. But, it's like looking at the world through a very low-resolution screen. The image is "pixelated" and full of gaps. You can see a big truck, but a pedestrian might just look like a single, blurry dot.

The Problem:
For a self-driving car to be safe, it needs to understand the scene perfectly. With the radar, it's hard to tell if that single blurry dot is a person, a sign, or just a random speck of noise. If you only look at that one dot (local view), you can't be sure. You need to step back and look at the whole picture (global view) to understand the context.

The Solution: DRIFT
The paper introduces a new AI model called DRIFT (Dual-Representation Inter-Fusion Transformer). Think of DRIFT as a two-person detective team working together to solve the mystery of what's on the road.

The Two Detectives

Instead of using just one way to look at the data, DRIFT uses two different "lenses" simultaneously:

  1. The "Microscope" Detective (The Point Path):

    • This detective looks at the raw, individual radar dots.
    • Superpower: It sees the tiny details, like the exact shape of a dot or its speed.
    • Weakness: It's so focused on the small details that it gets lost. It doesn't know where it is in the big picture.
  2. The "Satellite" Detective (The Pillar Path):

    • This detective organizes the dots into a grid (like a map with squares). It looks at groups of dots together.
    • Superpower: It sees the big picture. It understands the layout of the road, the drivable areas, and the general flow of traffic.
    • Weakness: It's too zoomed out. It might miss the small details that distinguish a person from a sign.

The Magic Trick: The "Coffee Break" (Feature Sharing)

In older models, these two detectives would work in separate rooms and only talk to each other at the very end. By then, they might have missed crucial clues.

DRIFT changes the game. It forces the two detectives to have a "coffee break" (a Feature Sharing Block) at every single step of their investigation.

  • The Microscope whispers to the Satellite: "Hey, I see a dot moving fast right here. Does that fit your map?"
  • The Satellite whispers back: "Yes, that dot is in a pedestrian zone, so it's probably a person, not a sign."

They constantly swap notes. The Microscope gets context, and the Satellite gets detail. They intertwine their findings until they are sure of what they are seeing.

Why Transformers? (The "Gossip Network")

The paper also uses something called Transformers. In the world of AI, imagine a group of people at a party.

  • Old AI: Only talks to the person standing right next to them.
  • Transformer: Can hear the gossip from the entire room instantly.

For the radar data, which is sparse and noisy, the "Satellite" detective uses this gossip network to connect dots that are far apart. It realizes, "Oh, that dot over there and that dot way over there are part of the same car," even if there are huge gaps between them.

The Results: Why Should We Care?

The researchers tested DRIFT on real-world data (including a dataset from Delft, Netherlands).

  • The Competition: Other top models (like CenterPoint) got about 45% accuracy in spotting objects.
  • DRIFT: Got 52.6% accuracy.

That might sound like a small number, but in self-driving, it's a massive leap. DRIFT is much better at spotting pedestrians and cyclists, who are small, hard to see, and often look like noise on a radar.

The Bottom Line

DRIFT is like giving a self-driving car two pairs of eyes that constantly talk to each other. One pair sees the tiny details, the other sees the whole world. By forcing them to share information constantly, the car can finally "see" clearly through the fog, spotting a pedestrian in the rain where other systems would just see static noise. This makes autonomous driving safer, cheaper, and more reliable for everyone.