HeCoFuse: Cross-Modal Complementary V2X Cooperative Perception with Heterogeneous Sensors

HeCoFuse is a state-of-the-art, unified framework for V2X cooperative perception that employs hierarchical fusion with adaptive attention and a dynamic learning strategy to robustly handle heterogeneous sensor configurations, achieving top performance on the TUMTraf-V2X dataset and winning the CVPR 2025 DriveX challenge.

Chuheng Wei, Ziye Qin, Walter Zimmer, Guoyuan Wu, Matthew J. Barth

Published 2026-03-24
📖 5 min read🧠 Deep dive

Imagine you are trying to solve a giant jigsaw puzzle, but the pieces are scattered across a city. Some people have high-definition, 3D laser scanners (LiDAR) that see perfectly in the dark but can't read colors. Others have high-quality cameras that see colors and textures beautifully but struggle in the dark or with distance. Some people have both, and some have neither.

This is the real-world problem of V2X (Vehicle-to-Everything) Cooperative Perception. Cars and traffic lights need to "talk" to each other to see around corners and avoid accidents. But in the real world, not every car or traffic light is equipped with the same expensive sensors.

The paper you shared introduces HeCoFuse, a smart system designed to solve this "mismatched puzzle" problem. Here is how it works, explained simply:

1. The Problem: The "Mismatched Team"

In the past, researchers assumed every car and traffic light had the exact same super-sensors. But in reality, a city is a mix:

  • Car A has a LiDAR and a Camera.
  • Car B only has a Camera.
  • Traffic Light C only has a LiDAR.
  • Traffic Light D has nothing (or just a basic sensor).

If you try to force these different teams to work together using old methods, the system gets confused. It's like trying to mix oil and water, or asking a person who only speaks French to translate a document written in Chinese without a dictionary. The data doesn't line up, and the system fails.

2. The Solution: HeCoFuse (The "Universal Translator")

HeCoFuse is a new framework that acts as a universal translator and team manager. It doesn't care if your neighbor has a fancy laser scanner or just a cheap camera. It can take whatever information they have and blend it perfectly.

It does this using three main "superpowers":

A. The "Smart Mixer" (Hierarchical Attention)

Imagine you are in a noisy room with a group of people trying to describe a car driving by.

  • The person with the LiDAR says, "It's exactly 50 meters away!" (Very accurate on distance).
  • The person with the Camera says, "It's a red bus!" (Very accurate on what it is).

Old systems might just shout all the information at once, creating a mess. HeCoFuse uses a Smart Mixer. It listens to the LiDAR person when talking about distance and the Camera person when talking about color. It weighs who is right about what, based on who has the best tool for that specific job. This ensures the final picture is clear, even if one person is missing.

B. The "Zoom Lens" (Adaptive Spatial Resolution)

Sometimes, one sensor gives you a super-detailed, high-resolution map, while another gives you a blurry, low-resolution sketch. If you try to glue them together directly, the result is jagged and ugly.

HeCoFuse acts like a smart zoom lens. Before mixing the data, it adjusts the "resolution" of the information. If one sensor is low-quality, it smooths out the high-quality data to match, or vice versa, so they fit together perfectly. This saves computer power (battery) while keeping the image sharp.

C. The "Flexible Team Player" (Cooperative Learning)

Most AI systems are trained to work only in one specific setup (e.g., "Only when everyone has LiDAR"). If you change the setup, the AI breaks.

HeCoFuse is trained like a versatile athlete. During its training, the system was randomly given different combinations of sensors (sometimes 2 LiDARs, sometimes 1 Camera + 1 LiDAR, sometimes just Cameras). It learned to adapt on the fly. If a sensor fails or is missing, the system doesn't crash; it just re-arranges its strategy to make the best of what's left.

3. The Results: Winning the Race

The researchers tested HeCoFuse on a real-world dataset from Munich (TUMTraf-V2X), which is like a digital twin of a busy city intersection.

  • The Score: In a competition called the CVPR 2025 DriveX Challenge, HeCoFuse took First Place.
  • The Proof: Even when the car had only a camera and the traffic light had a LiDAR (a very difficult mismatch), the system still performed incredibly well. It was more accurate than previous methods that assumed everyone had the same expensive gear.

Why This Matters

Think of HeCoFuse as the glue that holds a smart city together.

  • For the City: You don't need to replace every old traffic light with a million-dollar sensor. You can keep the old ones and just add the new system.
  • For Safety: It means cars can "see" around corners and in the dark, even if their own sensors are limited, because they are borrowing the "eyes" of the infrastructure.
  • For the Future: It makes autonomous driving cheaper and more realistic, because it accepts the messy, imperfect reality of the real world rather than waiting for a perfect, expensive future.

In short, HeCoFuse is the system that says: "It doesn't matter what tools you have; as long as we talk to each other, we can all see the whole picture."

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →