RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

Imagine you are driving a self-driving car. Your car has eyes (cameras) and a special kind of "sonar" (4D radar).

The Eyes (Cameras): They are great at recognizing what things are. They can tell you, "That's a red stop sign," or "That's a cute dog." But they are terrible at judging how far away things are, especially in the dark or fog. It's like looking at a flat painting; you can see the details, but you can't tell if the tree is 10 feet away or 100 feet away.
The Sonar (Radar): It's the opposite. It's amazing at telling you exactly where things are in 3D space, even in a storm. But it's "blind" to details. It just sees a blurry blob and says, "Something is there," without knowing if it's a car, a person, or a tree.

The Problem: The "Blind Date" of Self-Driving Cars

In the real world, one car isn't enough. Cars need to talk to each other (Collaborative Perception) to see around corners and through traffic jams.

However, current systems have a big problem:

The "Blurry Map" Issue: When cars share what they "see" with cameras, their maps get messy. Because cameras are bad at depth, the shared data looks like a smeared watercolor painting. When Car A tries to merge its "smeared" view with Car B's view, the cars don't line up. It's like trying to build a Lego tower when half the bricks are made of jelly.
The "Chatterbox" Issue: To fix the mess, cars try to send everything to each other. This clogs the network, like a group chat where everyone is spamming photos. It uses too much data and slows everything down.

The Solution: RC-GeoCP (The "Smart Team Leader")

This paper introduces a new system called RC-GeoCP. Think of it as a smart team leader that organizes the conversation between cars so they can see clearly without shouting over each other.

It works in three simple steps:

1. The "Radar Anchor" (Geometric Structure Rectification)

Imagine you are trying to draw a map of a room, but your ruler is broken. You ask a friend who has a laser measure (the radar) to help.

How it works: The system takes the "smeared" camera images and uses the radar's precise measurements as a skeleton or anchor.
The Analogy: Think of the camera image as a loose sheet of fabric and the radar data as a rigid wire frame. RC-GeoCP stretches the fabric over the wire frame. Suddenly, the blurry "dog" in the camera image snaps into the exact 3D spot where the radar says it is. The "jelly" turns into solid Lego bricks.

2. The "Smart Messenger" (Uncertainty-Aware Communication)

Instead of every car sending a full video feed (which is heavy and slow), this system acts like a smart editor.

How it works: The car asks itself, "What am I confused about?" If I see a car clearly, I don't need help. But if I see a foggy spot where I'm not sure if there's a pedestrian, I send a message saying, "Hey, I'm unsure about this specific spot."
The Analogy: Imagine a group of hikers. Instead of everyone shouting their entire life story to the group, they only shout, "I see a bear!" or "I'm lost here!" The system only sends the most important, confusing, or missing pieces of the puzzle. This saves 60% of the data traffic!

3. The "Consensus Builder" (Consensus-Driven Assembler)

Now that the cars have sent their "smart messages," they need to put the puzzle together.

How it works: The system uses the radar "anchors" again to make sure everyone's puzzle pieces fit together perfectly. It ignores the parts where cars disagree because the radar says "no object here," and it highlights the parts where the radar says "object here."
The Analogy: It's like a conductor in an orchestra. Even if the violinist (Camera) is playing a bit off-key, the conductor (Radar) knows exactly where the note should be. The system forces the music to align with the conductor's beat, creating a harmonious, clear picture of the road.

Why Does This Matter?

It's Cheaper: You don't need expensive, fragile LiDAR sensors (the "gold standard" but very costly). You can use cheaper cameras and radar.
It's Safer: It works better in bad weather (rain, fog, night) where cameras usually fail.
It's Faster: Because it sends less data, the cars react faster to dangers.

In a nutshell: RC-GeoCP is like giving a group of self-driving cars a shared, 3D "GPS anchor" that keeps their blurry camera views in perfect shape, while only letting them talk about the things they are actually confused about. It makes the whole team smarter, faster, and safer.

1. Problem Statement

Collaborative Perception (CP) aims to extend the sensing range of autonomous vehicles by sharing information among multiple agents (vehicles and infrastructure). While existing CP systems predominantly rely on LiDAR for precise 3D geometry, LiDAR is expensive and suffers performance degradation in adverse weather. Cameras offer rich semantics but lack depth certainty, leading to "depth ambiguity" and spatial smearing when projected into Bird's-Eye-View (BEV). 4D Radars provide robust, weather-invariant range and velocity measurements but are sparse and lack semantic detail.

The core challenge addressed by this paper is the synergy between cameras and 4D radar in a collaborative setting. Specifically:

Geometric Misalignment: Monocular camera features suffer from depth ambiguity, causing spatial dispersion across agents.
Underexplored Fusion: While radar-camera fusion exists for single agents, its application in multi-agent CP is limited due to the difficulty of aligning asynchronous, sparse radar data with dense visual features across different viewpoints.
Communication Efficiency: Transmitting full feature maps is bandwidth-prohibitive; existing methods often prioritize high-confidence regions without resolving underlying geometric inconsistencies.

2. Methodology: RC-GeoCP

The authors propose RC-GeoCP, a framework that establishes a radar-anchored geometric consensus to guide communication and fusion. The pipeline consists of three tightly coupled components:

A. Geometric Structure Rectification (GSR)

Goal: To align visual semantics with physical space, mitigating the depth ambiguity of monocular cameras.
Mechanism:
- Uses sparse 4D radar features as a "physical anchor" to ground camera-derived BEV features.
- Initializes a radar-grounded query field by fusing camera features with downsampled radar features.
- Employs Deformable Cross-Attention to lift BEV queries into 3D space, aggregating image features based on radar-derived depth cues.
- Adaptive Gating: A gating mechanism balances visual richness and geometric precision, attenuating radar influence in regions of high visual certainty to preserve pure camera semantics where appropriate.
Outcome: Generates spatially grounded, geometry-consistent representations ( $\tilde{F}_i$ ) before communication.

B. Uncertainty-Aware Communication (UAC)

Goal: To select and transmit only the most informative tokens under strict bandwidth constraints, focusing on resolving geometric ambiguities rather than reinforcing redundancy.
Mechanism:
- Formulates communication as an ego-centric conditional entropy reduction process.
- Demand Map Generation: The ego agent calculates its own perceptual uncertainty ( $U_i = 1 - \text{Confidence}$ ) and evaluates inter-agent disagreement ( $D_{i,j}$ ) with neighbors.
- Token Selection: Instead of transmitting dense maps, the system selects a sparse set of Ego Demand Tokens (Top-K locations where the neighbor's data differs significantly from the ego's uncertain perception).
- Learnable Agent-wise Tokens: To prevent information loss from discarding residual features, a learnable token aggregates the remaining (non-selected) features via cross-attention, preserving global context.
Outcome: Highly efficient transmission of only high-value, complementary information.

C. Consensus-Driven Assembler (CDA)

Goal: To aggregate multi-agent information into a globally coherent representation using shared geometric references.
Mechanism:
- Agents predict a geometric reliability map from their radar features (without transmitting the raw radar data).
- These maps are aligned to the ego frame to create a Geometric Consensus ( $G_{i,j}$ ).
- During fusion, this consensus is injected into the attention logits of the transformer-based aggregator. It acts as a prior, weighting the attention mechanism to favor tokens that align with the shared physical geometry.
Outcome: Ensures that the final fused representation is physically grounded and spatially consistent, even when fusing heterogeneous data from different agents.

3. Key Contributions

First Radar-Camera CP Framework: RC-GeoCP is the first framework to explore the fusion of 4D radar and images specifically for collaborative perception, establishing a new benchmark.
Geometric Consensus Mechanism: Introduces a novel paradigm where radar serves as a shared physical reference to anchor camera semantics, solving the depth ambiguity problem in multi-agent settings.
Efficient Communication Strategy: Proposes UAC, which dynamically selects tokens based on epistemic uncertainty and inter-agent disagreement, significantly reducing communication overhead.
Unified Benchmarks: Establishes the first unified evaluation benchmarks for radar-camera CP on V2X-Radar (real-world) and V2X-R (simulated) datasets.

4. Experimental Results

The method was evaluated on V2X-Radar and V2X-R datasets, comparing against state-of-the-art LiDAR-centric and camera-centric CP methods (e.g., V2X-ViT, CoAlign, HEAL, Where2comm).

Performance:
- V2X-Radar: RC-GeoCP achieved 44.55% AP@0.5 and 25.92% AP@0.7, outperforming the best prior method (Where2comm) by 3.72% and 7.61% respectively. The gap widened at stricter IoU thresholds (AP@0.7), highlighting improved localization accuracy.
- V2X-R: Achieved 81.90% AP@0.5 and 65.09% AP@0.7, surpassing the leading baseline (HEAL) by ~2.9%.
Communication Efficiency:
- RC-GeoCP achieved these results with a communication cost of 2.39 (normalized units), which is ~40% lower than standard baselines (4.00) and nearly 66% lower than high-overhead methods like CoAlign and HEAL.
Robustness:
- Experiments with simulated pose noise and time delays (up to 100ms) showed RC-GeoCP maintains superior stability compared to baselines, validating the robustness of the geometric consensus mechanism.
Ablation Studies:
- Removing GSR caused a significant drop in performance (e.g., -7.05% AP@0.7 on V2X-Radar), proving the necessity of radar-anchored grounding.
- The combination of GSR, UAC, and CDA yielded the optimal trade-off between accuracy and bandwidth.

5. Significance

Practical Deployment: By leveraging 4D radar (which is cheaper and more weather-resistant than LiDAR) to guide camera perception, RC-GeoCP offers a scalable, cost-effective solution for real-world autonomous driving.
Bandwidth Optimization: The framework demonstrates that high-precision collaborative perception does not require massive data transmission; instead, semantic-geometric alignment allows for sparse, high-value communication.
New Research Direction: It shifts the focus from LiDAR-centric CP to multi-modal radar-camera collaboration, providing a principled path for robust perception in heterogeneous V2X ecosystems.

In summary, RC-GeoCP successfully bridges the gap between the semantic richness of cameras and the geometric stability of radar, creating a robust, communication-efficient collaborative perception system that outperforms current state-of-the-art methods.