Linking Modality Isolation in Heterogeneous Collaborative Perception

Imagine a group of friends trying to solve a mystery together, but they all speak different languages and have never met each other before.

Friend A only has a high-tech 3D laser scanner (LiDAR) that sees the world as a cloud of points.
Friend B only has a standard camera that sees the world as a 2D photo.
Friend C has a different type of laser scanner.

In the world of self-driving cars, this is called Heterogeneous Collaborative Perception. The goal is for these cars to share what they see so they can "see" more than any single car could alone.

The Problem: The "Modality Isolation" Wall

Usually, to teach these cars to work together, you show them videos where Friend A and Friend B are looking at the exact same car at the exact same time. This helps them learn, "Oh, this laser point cloud looks like this camera picture."

But in the real world, data is messy.

Friend A's data was collected in New York in 2023.
Friend B's data was collected in Tokyo in 2024.
They never appear in the same video frame together.

This is Modality Isolation. It's like trying to translate a book from English to Japanese, but you've never seen an English sentence and its Japanese translation side-by-side. You only have a pile of English books and a pile of Japanese books, but no dictionary to link them.

Existing methods try to force a connection by guessing, but they fail because they rely on seeing the same scene twice. They also require sending huge amounts of raw data (like sending the whole photo instead of just a description), which clogs up the network.

The Solution: CodeAlign (The Universal Translator)

The authors propose CodeAlign, a clever new system that acts like a Universal Translator that doesn't need the original text to work.

Here is how it works, using a simple analogy:

1. The "Codebook" (The Secret Dictionary)

Instead of trying to translate "Laser Point" directly to "Camera Pixel" (which is hard because they are so different), CodeAlign creates a Codebook for each friend.

Think of the Codebook as a dictionary of "concepts" or "stamps."
Friend A (LiDAR) learns to describe everything using a specific set of stamps (e.g., "Car-Shape-Stamp," "Tree-Shape-Stamp").
Friend B (Camera) learns to describe everything using the same set of stamps.

Even though they speak different languages, they agree on the meaning of the stamps.

2. The "FCF" Translation (Feature-Code-Feature)

This is the magic trick. When Friend A wants to talk to Friend B:

Feature to Code: Friend A looks at a laser point cloud and says, "This is Stamp #42." (They don't send the whole cloud; they just send the number 42).
Code to Feature: Friend B receives "Stamp #42." Because they both use the same dictionary, Friend B knows exactly what "Stamp #42" looks like in their own camera world. They instantly reconstruct a perfect camera image of that object.

The Result: They never needed to see the same scene together to learn this. They just needed to learn their own "stamps" separately, and then agree on what the stamps mean.

Why is this a Big Deal?

The paper highlights three massive wins:

It Works Without Meeting: You can train Friend A and Friend B separately. You don't need a dataset where they are together. This solves the "Modality Isolation" problem.
It's Super Fast and Cheap:
- Old Way: Sending a full 3D scan is like mailing a heavy suitcase.
- CodeAlign Way: Sending "Stamp #42" is like sending a text message.
- The paper says this reduces the data traffic by 1,024 times. That's like going from a highway traffic jam to a single bicycle lane.
It's Smarter: Because the "stamps" are very precise, the cars actually see better than before. In tests, it improved detection accuracy significantly compared to previous methods.

The "Group" Trick

If two friends do happen to have data together (like Friend A and Friend C), CodeAlign lets them share a Group Codebook. This is like two friends who speak similar dialects agreeing to use a shared, slightly larger dictionary. This makes the translation even more accurate and saves time.

Summary

CodeAlign is like giving every self-driving car a pocket dictionary of "universal concepts." Even if the cars have never met and have totally different sensors, they can instantly translate their observations into a common language, share it efficiently, and understand the world together better than ever before. It turns a chaotic, disconnected group of sensors into a perfectly synchronized team.

1. Problem Statement: Modality Isolation in Heterogeneous Perception

Context: Collaborative perception allows multiple agents (e.g., autonomous vehicles) to share sensory data to improve environmental understanding. However, real-world deployments involve heterogeneity (different sensor types like LiDAR vs. Camera, varying parameters, and different models).

The Core Challenge: The paper identifies a critical, underexplored issue termed Modality Isolation.

Definition: Modality isolation occurs when agents with different modalities never co-occur in the same training data frames. This happens because datasets are often collected by different institutions, at different times, or in different locations, meaning specific modality pairs (e.g., LiDAR from Car A and Camera from Car B) have no shared spatial observations.
Consequence: Existing alignment methods rely on spatial correspondence (overlapping observations or shared ground truth labels) to learn feature alignment. When modalities are isolated, this supervision is absent, causing traditional methods to fail or require expensive retraining of encoders.
Current Limitations:
- Late Fusion: Bypasses alignment but suffers from localization noise and suboptimal performance.
- Existing Alignment (e.g., HEAL, HMViT): Require co-occurring data or retraining encoders on local data, which is computationally expensive and lacks scalability.

2. Methodology: CodeAlign Framework

The authors propose CodeAlign, the first efficient, co-occurrence-free alignment framework. The core philosophy is to shift from learning spatial correspondence to learning representation consistency via a discrete codebook.

Key Components:

Code Space Construction:
- Instead of aligning dense continuous features directly, CodeAlign inserts a learnable codebook between the encoder and the fusion network for each modality.
- Features are quantized into discrete code indices based on the nearest embedding in the codebook.
- Group Code Space: For modalities that do have co-occurring data, a shared codebook is learned to improve alignment quality and reduce the need for pairwise training.
- Efficiency: This reduces communication load significantly, as agents transmit compact code indices rather than dense feature maps.
Feature-Code-Feature (FCF) Translation:
- To align isolated modalities, CodeAlign employs a Feature $\to$ Code $\to$ Feature translation mechanism.
- Process:
  1. Source Feature to Target Code: The source modality's dense features are translated into a code map defined by the target modality's codebook.
  2. Target Code to Target Feature: The target modality's decoder reconstructs dense features from these codes.
- Result: The reconstructed features inherently belong to the target modality's feature space, achieving alignment without needing the source and target to have seen the same scene.
Lightweight One-to-Many Translator:
- To handle scalability (many modalities), the authors design a multi-head Code Translator.
- Instead of training a separate model for every pair of modalities ( $O(N^2)$ ), a single backbone with multiple output heads translates features to any target modality ( $O(N)$ ).
- Local Data Training: The translator is trained using only local data from the source modality, with supervision coming from the target backend's detection loss. This ensures data privacy and eliminates the need for cross-institutional data sharing.

3. Key Contributions

Problem Identification: Formally defined and analyzed the "Modality Isolation" challenge, highlighting the failure of spatial-correspondence-based methods in heterogeneous, non-overlapping datasets.
CodeAlign Framework: Proposed the first framework to align isolated modalities without co-occurring data by leveraging representation consistency via codebooks.
FCF Translation Mechanism: Introduced a novel translation pipeline that maps features to a target modality's discrete code space and back, effectively bridging domain gaps.
Efficiency & Scalability:
- Communication: Reduces communication load by 1024x compared to intermediate fusion methods (transmitting indices vs. dense features).
- Training: Requires only 8% of the training parameters of state-of-the-art methods (like HEAL) in multi-modal scenarios.
- Architecture: Designed a one-to-many translator to avoid quadratic training complexity.

4. Experimental Results

The method was evaluated on OPV2V (simulated) and DAIR-V2X (real-world) datasets with various heterogeneous sensor configurations (LiDAR and Cameras with different resolutions).

Performance (OPV2V):
- Achieved State-of-the-Art (SOTA) performance.
- Outperformed HEAL by 2.36% (AP30) and 2.15% (AP50).
- In a 3-modal scenario, it used only 0.8M parameters for alignment compared to HEAL's 1.0M (and significantly less than other baselines).
Performance (DAIR-V2X):
- Showed significant improvements over baselines in real-world scenarios, with AP30 improvements of up to 12.08% over HEAL.
Robustness:
- Demonstrated superior robustness to pose errors compared to Late Fusion and HEAL.
Ablation Studies:
- Confirmed that Group Code Space Construction improves alignment for non-isolated pairs.
- Validated that the Multi-head Translator offers a linear scaling solution with negligible performance drop compared to one-to-one translators.

5. Significance and Impact

Enabling Real-World Deployment: CodeAlign solves a critical bottleneck in deploying collaborative perception systems where data privacy and collection logistics prevent the creation of perfectly aligned, multi-modal datasets.
Privacy-Preserving: By training on local data only and using codebook indices for transmission, it facilitates collaboration between different institutions without sharing raw data.
Scalability: The one-to-many translator architecture makes the system practical for large-scale fleets with diverse sensor configurations, avoiding the computational explosion of pairwise alignment.
Communication Efficiency: The massive reduction in bandwidth requirements (1024x) makes collaborative perception viable for low-bandwidth communication channels (e.g., 5G/6G edge networks).

In summary, CodeAlign provides a robust, efficient, and scalable solution to the modality isolation problem, enabling heterogeneous agents to collaborate effectively even when they have never "seen" each other's data during training.