HeCoFuse: Cross-Modal Complementary V2X Cooperative Perception with Heterogeneous Sensors

Imagine you are trying to solve a giant jigsaw puzzle, but the pieces are scattered across a city. Some people have high-definition, 3D laser scanners (LiDAR) that see perfectly in the dark but can't read colors. Others have high-quality cameras that see colors and textures beautifully but struggle in the dark or with distance. Some people have both, and some have neither.

This is the real-world problem of V2X (Vehicle-to-Everything) Cooperative Perception. Cars and traffic lights need to "talk" to each other to see around corners and avoid accidents. But in the real world, not every car or traffic light is equipped with the same expensive sensors.

The paper you shared introduces HeCoFuse, a smart system designed to solve this "mismatched puzzle" problem. Here is how it works, explained simply:

1. The Problem: The "Mismatched Team"

In the past, researchers assumed every car and traffic light had the exact same super-sensors. But in reality, a city is a mix:

Car A has a LiDAR and a Camera.
Car B only has a Camera.
Traffic Light C only has a LiDAR.
Traffic Light D has nothing (or just a basic sensor).

If you try to force these different teams to work together using old methods, the system gets confused. It's like trying to mix oil and water, or asking a person who only speaks French to translate a document written in Chinese without a dictionary. The data doesn't line up, and the system fails.

2. The Solution: HeCoFuse (The "Universal Translator")

HeCoFuse is a new framework that acts as a universal translator and team manager. It doesn't care if your neighbor has a fancy laser scanner or just a cheap camera. It can take whatever information they have and blend it perfectly.

It does this using three main "superpowers":

A. The "Smart Mixer" (Hierarchical Attention)

Imagine you are in a noisy room with a group of people trying to describe a car driving by.

The person with the LiDAR says, "It's exactly 50 meters away!" (Very accurate on distance).
The person with the Camera says, "It's a red bus!" (Very accurate on what it is).

Old systems might just shout all the information at once, creating a mess. HeCoFuse uses a Smart Mixer. It listens to the LiDAR person when talking about distance and the Camera person when talking about color. It weighs who is right about what, based on who has the best tool for that specific job. This ensures the final picture is clear, even if one person is missing.

B. The "Zoom Lens" (Adaptive Spatial Resolution)

Sometimes, one sensor gives you a super-detailed, high-resolution map, while another gives you a blurry, low-resolution sketch. If you try to glue them together directly, the result is jagged and ugly.

HeCoFuse acts like a smart zoom lens. Before mixing the data, it adjusts the "resolution" of the information. If one sensor is low-quality, it smooths out the high-quality data to match, or vice versa, so they fit together perfectly. This saves computer power (battery) while keeping the image sharp.

C. The "Flexible Team Player" (Cooperative Learning)

Most AI systems are trained to work only in one specific setup (e.g., "Only when everyone has LiDAR"). If you change the setup, the AI breaks.

HeCoFuse is trained like a versatile athlete. During its training, the system was randomly given different combinations of sensors (sometimes 2 LiDARs, sometimes 1 Camera + 1 LiDAR, sometimes just Cameras). It learned to adapt on the fly. If a sensor fails or is missing, the system doesn't crash; it just re-arranges its strategy to make the best of what's left.

3. The Results: Winning the Race

The researchers tested HeCoFuse on a real-world dataset from Munich (TUMTraf-V2X), which is like a digital twin of a busy city intersection.

The Score: In a competition called the CVPR 2025 DriveX Challenge, HeCoFuse took First Place.
The Proof: Even when the car had only a camera and the traffic light had a LiDAR (a very difficult mismatch), the system still performed incredibly well. It was more accurate than previous methods that assumed everyone had the same expensive gear.

Why This Matters

Think of HeCoFuse as the glue that holds a smart city together.

For the City: You don't need to replace every old traffic light with a million-dollar sensor. You can keep the old ones and just add the new system.
For Safety: It means cars can "see" around corners and in the dark, even if their own sensors are limited, because they are borrowing the "eyes" of the infrastructure.
For the Future: It makes autonomous driving cheaper and more realistic, because it accepts the messy, imperfect reality of the real world rather than waiting for a perfect, expensive future.

In short, HeCoFuse is the system that says: "It doesn't matter what tools you have; as long as we talk to each other, we can all see the whole picture."

1. Problem Statement

Vehicle-to-Everything (V2X) cooperative perception aims to extend the perceptual horizon of autonomous vehicles by sharing data between vehicles and roadside infrastructure. However, most existing research assumes homogeneous sensor configurations (e.g., all nodes have identical LiDAR and camera setups).

In real-world deployments, sensor heterogeneity is the norm due to cost constraints and incremental hardware upgrades. This creates three critical challenges:

Asymmetric Modalities: Nodes may possess only LiDAR, only cameras, or both, leading to mismatches in spatial resolution, field of view, and information density.
Feature Inconsistency: Naive fusion strategies fail when a specific modality is missing at a node, resulting in suboptimal feature alignment.
Robustness Requirements: Systems must degrade gracefully under partial sensor failures or when interacting with nodes of vastly different capabilities.

2. Methodology: HeCoFuse Framework

The authors propose HeCoFuse, a unified framework designed to handle nine distinct sensor configurations (combinations of LiDAR [L] and Camera [C] on vehicle and infrastructure nodes). The architecture consists of three main components:

A. Modular Feature Extraction & Adaptation

Modality-Specific Encoders:
- LiDAR: Uses dynamic voxelization followed by a 3D sparse convolutional backbone to generate Bird's Eye View (BEV) features.
- Camera: Uses a 2D CNN backbone (e.g., YOLOv8) with a view-transformation module (estimating depth and projecting to 3D) to generate BEV features.
Intra-Node Fusion: For nodes with both sensors (LC), features are concatenated and processed via a convolutional layer to combine spatial precision (LiDAR) with semantic richness (Camera).
PseudoFusion: For single-sensor nodes (L-only or C-only), a sensor-specific adapter (1×1 convolution) transforms features into a consistent dimensionality ( $B \times C_{out} \times H \times W$ ). If a sensor is missing, a fallback mechanism generates baseline features to ensure dimensional compatibility.

B. Hierarchical Attention Fusion (HAF)

To address cross-modality misalignment and varying feature quality, HeCoFuse employs a two-stage attention mechanism:

Channel Attention: Dynamically weights feature channels based on sensor reliability. For example, in an L+C scenario, it prioritizes LiDAR channels for distance estimation and Camera channels for appearance features.
Spatial Attention: Generates spatial maps to highlight regions where specific sensors provide the most reliable data (e.g., emphasizing LiDAR for distant objects and Cameras for visually complex regions).

Fusion Formula: The final fused feature is a weighted sum of vehicle and infrastructure features, modulated by both channel and spatial attention maps.

C. Adaptive Spatial Resolution (ASR)

Different sensor configurations produce BEV features with varying computational costs and information densities.

Downsampling: Features are downsampled based on the sensor type (e.g., aggressive downsampling for Camera-only, moderate for LiDAR-only) to balance computational load.
Upsampling: After fusion, features are upsampled based on the highest precision available ( $min(s_{vehicle}, s_{infra})$ ) to restore detailed information.
Benefit: This reduces computational overhead by 30–45% in heterogeneous configurations without sacrificing accuracy.

D. Cooperative Learning Strategy

Instead of training separate models for each configuration, HeCoFuse uses a unified training approach. During training, the model randomly samples from all nine possible sensor configurations. This forces the network to learn robust fusion strategies that generalize across any combination of available sensors.

3. Key Contributions

First Unified Framework: HeCoFuse is the first framework explicitly designed for cooperative perception across arbitrary combinations of heterogeneous sensors (L, C, LC) in V2X networks.
Novel Mechanisms: Introduction of Hierarchical Attention Fusion (HAF) for dynamic cross-modal weighting and Adaptive Spatial Resolution (ASR) for balancing efficiency and fidelity.
Comprehensive Validation: Extensive experiments on the real-world TUMTraf-V2X dataset covering nine heterogeneous scenarios, demonstrating robustness where previous methods fail.

4. Experimental Results

The framework was evaluated on the TUMTraf-V2X dataset (Munich urban intersection) and achieved 1st place in the CVPR 2025 DriveX Challenge.

Performance Metrics:
- Full Configuration (LC+LC): Achieved 43.22% 3D mAP, outperforming the baseline (CoopDet3D) by 1.17%.
- Best Heterogeneous Scenario (L+LC): Achieved 43.38% 3D mAP, surpassing the full-sensor configuration. This suggests that a LiDAR-equipped vehicle combined with a multi-sensor infrastructure node is highly effective.
- Robustness: Maintained 3D mAP between 21.74% and 43.38% across all nine configurations.
Key Insights:
- LiDAR Dominance: LiDAR contributes significantly more to 3D detection than cameras (e.g., L+L is ~20% better than C+C).
- Infrastructure Value: Infrastructure sensors (elevated, wide FOV) play a pivotal role. An infrastructure node with LC significantly boosts performance even if the vehicle only has L.
- Nighttime Robustness: Qualitative analysis shows the system maintains detection capability in low-light conditions by dynamically shifting reliance to LiDAR features.

5. Significance

HeCoFuse bridges the gap between idealized research assumptions and the messy reality of V2X deployment. By proving that a single model can effectively handle partial sensor availability and asymmetric configurations, it offers a scalable solution for real-world autonomous driving. The framework's ability to achieve state-of-the-art performance while reducing computational costs makes it a viable candidate for large-scale, heterogeneous smart city deployments.

Code Availability: The implementation is open-source at https://github.com/ChuhengWei/HeCoFuse.