Linking Modality Isolation in Heterogeneous Collaborative Perception

To address the challenge of modality isolation in heterogeneous collaborative perception where agents lack co-occurring training data, the paper proposes CodeAlign, an efficient, co-occurrence-free framework that achieves state-of-the-art performance by aligning modalities through cross-modal feature-code-feature translation using codebooks.

Changxing Liu, Zichen Chao, Siheng Chen

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine a group of friends trying to solve a mystery together, but they all speak different languages and have never met each other before.

  • Friend A only has a high-tech 3D laser scanner (LiDAR) that sees the world as a cloud of points.
  • Friend B only has a standard camera that sees the world as a 2D photo.
  • Friend C has a different type of laser scanner.

In the world of self-driving cars, this is called Heterogeneous Collaborative Perception. The goal is for these cars to share what they see so they can "see" more than any single car could alone.

The Problem: The "Modality Isolation" Wall

Usually, to teach these cars to work together, you show them videos where Friend A and Friend B are looking at the exact same car at the exact same time. This helps them learn, "Oh, this laser point cloud looks like this camera picture."

But in the real world, data is messy.

  • Friend A's data was collected in New York in 2023.
  • Friend B's data was collected in Tokyo in 2024.
  • They never appear in the same video frame together.

This is Modality Isolation. It's like trying to translate a book from English to Japanese, but you've never seen an English sentence and its Japanese translation side-by-side. You only have a pile of English books and a pile of Japanese books, but no dictionary to link them.

Existing methods try to force a connection by guessing, but they fail because they rely on seeing the same scene twice. They also require sending huge amounts of raw data (like sending the whole photo instead of just a description), which clogs up the network.

The Solution: CodeAlign (The Universal Translator)

The authors propose CodeAlign, a clever new system that acts like a Universal Translator that doesn't need the original text to work.

Here is how it works, using a simple analogy:

1. The "Codebook" (The Secret Dictionary)

Instead of trying to translate "Laser Point" directly to "Camera Pixel" (which is hard because they are so different), CodeAlign creates a Codebook for each friend.

  • Think of the Codebook as a dictionary of "concepts" or "stamps."
  • Friend A (LiDAR) learns to describe everything using a specific set of stamps (e.g., "Car-Shape-Stamp," "Tree-Shape-Stamp").
  • Friend B (Camera) learns to describe everything using the same set of stamps.

Even though they speak different languages, they agree on the meaning of the stamps.

2. The "FCF" Translation (Feature-Code-Feature)

This is the magic trick. When Friend A wants to talk to Friend B:

  1. Feature to Code: Friend A looks at a laser point cloud and says, "This is Stamp #42." (They don't send the whole cloud; they just send the number 42).
  2. Code to Feature: Friend B receives "Stamp #42." Because they both use the same dictionary, Friend B knows exactly what "Stamp #42" looks like in their own camera world. They instantly reconstruct a perfect camera image of that object.

The Result: They never needed to see the same scene together to learn this. They just needed to learn their own "stamps" separately, and then agree on what the stamps mean.

Why is this a Big Deal?

The paper highlights three massive wins:

  1. It Works Without Meeting: You can train Friend A and Friend B separately. You don't need a dataset where they are together. This solves the "Modality Isolation" problem.
  2. It's Super Fast and Cheap:
    • Old Way: Sending a full 3D scan is like mailing a heavy suitcase.
    • CodeAlign Way: Sending "Stamp #42" is like sending a text message.
    • The paper says this reduces the data traffic by 1,024 times. That's like going from a highway traffic jam to a single bicycle lane.
  3. It's Smarter: Because the "stamps" are very precise, the cars actually see better than before. In tests, it improved detection accuracy significantly compared to previous methods.

The "Group" Trick

If two friends do happen to have data together (like Friend A and Friend C), CodeAlign lets them share a Group Codebook. This is like two friends who speak similar dialects agreeing to use a shared, slightly larger dictionary. This makes the translation even more accurate and saves time.

Summary

CodeAlign is like giving every self-driving car a pocket dictionary of "universal concepts." Even if the cars have never met and have totally different sensors, they can instantly translate their observations into a common language, share it efficiently, and understand the world together better than ever before. It turns a chaotic, disconnected group of sensors into a perfectly synchronized team.