DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds

Imagine you are trying to navigate a car through a heavy fog. You have two tools to help you see:

A Super-Sharp Camera (LiDAR): It gives you a crystal-clear, high-definition 3D map of everything around you. But it's expensive, and if it rains or gets foggy, it gets confused.
A Weather-Proof Radar: It's cheap, works great in rain and fog, and even tells you how fast things are moving. But, it's like looking at the world through a very low-resolution screen. The image is "pixelated" and full of gaps. You can see a big truck, but a pedestrian might just look like a single, blurry dot.

The Problem:
For a self-driving car to be safe, it needs to understand the scene perfectly. With the radar, it's hard to tell if that single blurry dot is a person, a sign, or just a random speck of noise. If you only look at that one dot (local view), you can't be sure. You need to step back and look at the whole picture (global view) to understand the context.

The Solution: DRIFT
The paper introduces a new AI model called DRIFT (Dual-Representation Inter-Fusion Transformer). Think of DRIFT as a two-person detective team working together to solve the mystery of what's on the road.

The Two Detectives

Instead of using just one way to look at the data, DRIFT uses two different "lenses" simultaneously:

The "Microscope" Detective (The Point Path):
- This detective looks at the raw, individual radar dots.
- Superpower: It sees the tiny details, like the exact shape of a dot or its speed.
- Weakness: It's so focused on the small details that it gets lost. It doesn't know where it is in the big picture.
The "Satellite" Detective (The Pillar Path):
- This detective organizes the dots into a grid (like a map with squares). It looks at groups of dots together.
- Superpower: It sees the big picture. It understands the layout of the road, the drivable areas, and the general flow of traffic.
- Weakness: It's too zoomed out. It might miss the small details that distinguish a person from a sign.

The Magic Trick: The "Coffee Break" (Feature Sharing)

In older models, these two detectives would work in separate rooms and only talk to each other at the very end. By then, they might have missed crucial clues.

DRIFT changes the game. It forces the two detectives to have a "coffee break" (a Feature Sharing Block) at every single step of their investigation.

The Microscope whispers to the Satellite: "Hey, I see a dot moving fast right here. Does that fit your map?"
The Satellite whispers back: "Yes, that dot is in a pedestrian zone, so it's probably a person, not a sign."

They constantly swap notes. The Microscope gets context, and the Satellite gets detail. They intertwine their findings until they are sure of what they are seeing.

Why Transformers? (The "Gossip Network")

The paper also uses something called Transformers. In the world of AI, imagine a group of people at a party.

Old AI: Only talks to the person standing right next to them.
Transformer: Can hear the gossip from the entire room instantly.

For the radar data, which is sparse and noisy, the "Satellite" detective uses this gossip network to connect dots that are far apart. It realizes, "Oh, that dot over there and that dot way over there are part of the same car," even if there are huge gaps between them.

The Results: Why Should We Care?

The researchers tested DRIFT on real-world data (including a dataset from Delft, Netherlands).

The Competition: Other top models (like CenterPoint) got about 45% accuracy in spotting objects.
DRIFT: Got 52.6% accuracy.

That might sound like a small number, but in self-driving, it's a massive leap. DRIFT is much better at spotting pedestrians and cyclists, who are small, hard to see, and often look like noise on a radar.

The Bottom Line

DRIFT is like giving a self-driving car two pairs of eyes that constantly talk to each other. One pair sees the tiny details, the other sees the whole world. By forcing them to share information constantly, the car can finally "see" clearly through the fog, spotting a pedestrian in the rain where other systems would just see static noise. This makes autonomous driving safer, cheaper, and more reliable for everyone.

Here is a detailed technical summary of the paper "DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds."

1. Problem Statement

Automated driving systems increasingly rely on 4D radars due to their low cost and robustness in adverse weather (rain, fog, low light) compared to cameras and LiDAR. However, 4D radar point clouds suffer from two critical limitations:

Extreme Sparsity: Radar point clouds have significantly lower density than LiDAR, making it difficult to detect objects based solely on local geometric features.
Noise and Clutter: Radar data often contains significant noise, leading to false positives or missed detections.

Existing deep learning models designed for LiDAR (which rely heavily on dense local features) or standard radar models (which often use only pillar/voxel representations) fail to fully exploit the unique characteristics of radar data. Specifically, they struggle to balance fine-grained local details (necessary for shape and velocity) with coarse-grained global context (necessary to understand scene layout and relative position in sparse data).

2. Methodology: DRIFT Architecture

The authors propose DRIFT (Dual-Representation Inter-Fusion Transformer), a novel backbone architecture designed specifically for 4D radar. The core innovation is an end-to-end parallel dual-path structure that fuses local and global representations through bi-directional feature sharing.

A. Dual-Path Architecture

The model processes the input radar point cloud through two parallel streams:

Point Path (Local Focus):
- Operates directly on raw point-level data ( $N \times 7$ tensor including coordinates, RCS, Doppler velocity, and time ID).
- Utilizes Point Transformer blocks to extract fine-grained local features.
- Designed to capture detailed shape and velocity information.
Pillar Path (Global Focus):
- Converts the point cloud into a sparse 2D Bird's-Eye-View (BEV) grid (pillarization).
- Utilizes Pillar Transformer blocks combined with sparse convolutions.
- Designed to encode coarse-grained global scene context.
- Key Adaptation: Unlike LiDAR models where global attention is computationally prohibitive, the extreme sparsity of radar allows the pillar path to use global self-attention effectively even at early stages, extending the receptive field to model long-range dependencies.

B. Inter-Fusion via Feature Sharing Blocks

A critical contribution is the Feature Sharing (FS) blocks inserted at multiple intermediate stages between the two paths. These blocks enable bi-directional information flow:

Mechanism: They allow the Point path to receive global context from the Pillar path, and the Pillar path to refine its features with local details from the Point path.
Fusion Strategies: The paper evaluates three fusion methods:
1. Addition/Concatenation: Simple fusion (inspired by PointPainting).
2. Cross-Attention: A more complex mechanism where points query pillars (and vice versa) to capture intricate relationships.
Implementation: The entire model is implemented using sparse data representations to maintain computational efficiency.

3. Key Contributions

Novel Dual-Representation Backbone: The first architecture to feature end-to-end parallel point and pillar paths specifically for radar, fully utilizing sparse data representations.
Bi-Directional Feature Sharing: Introduction of intermediate feature-sharing blocks that intertwine local and global features throughout the network, rather than just fusing at the end.
Transformer Integration for Radar: Successful adaptation of Transformer layers for both point and pillar paths, leveraging radar sparsity to enable global context modeling early in the network.
State-of-the-Art Performance: Demonstrated superior performance on object detection and free-road segmentation tasks compared to existing baselines.

4. Experimental Results

The model was evaluated on the View-of-Delft (VoD) dataset and a proprietary perciv-scenes-2 dataset.

A. Object Detection (VoD Dataset)

Performance: DRIFT achieved a 52.6% mAP (mean Average Precision) on the entire scene and 71.5% mAProi (in the driving corridor), outperforming the previous state-of-the-art CenterPoint (45.4% mAP).
Small Object Detection: Significant improvements were seen in detecting pedestrians and cyclists (42.2% AP for pedestrians on the full scene), proving the efficacy of combining local details with global context.
Pre-training: When pre-trained on the larger internal dataset, performance further improved to 53.1% mAP, highlighting the model's scalability.

B. Free-Road Segmentation (perciv-scenes-2)

DRIFT outperformed the CenterPoint baseline in both object detection and free-road segmentation (IoU), demonstrating versatility across tasks requiring different balances of local and global context.

C. Efficiency

Real-time Capability: The model achieves inference latencies of 16.4 ms to 20.0 ms on an NVIDIA RTX 4090, with memory usage around 5–8 GB, confirming its suitability for real-time automated driving.

D. Ablation Studies

Dual Path vs. Single Path: The dual-path design significantly outperformed single-path variants.
Transformers: Removing Transformer layers resulted in performance drops, confirming their necessity for global modeling.
Feature Sharing: Cross-attention fusion yielded the best results. Bi-directional sharing (both Point-to-Pillar and Pillar-to-Point) was found essential, with Pillar-to-Point sharing having a slightly larger impact due to the information loss inherent in voxelization.

5. Significance

This paper addresses a critical bottleneck in automated driving: the reliable perception of sparse and noisy radar data. By moving away from the "local-only" or "global-only" paradigms, DRIFT demonstrates that inter-fusing these representations at multiple stages is the optimal strategy for 4D radar.

The work is significant because:

It provides a robust, low-cost alternative to LiDAR-based perception systems, crucial for mass-market autonomous vehicles.
It establishes a new architectural paradigm (Dual-Representation Inter-Fusion) that can likely be adapted for other sparse sensor modalities.
It proves that Transformer-based global attention is computationally feasible and highly beneficial for radar, provided the data sparsity is leveraged correctly.