GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space

Imagine a group of autonomous cars driving down a highway, trying to see around corners and through fog. To do this safely, they need to "talk" to each other, sharing what their sensors see. This is called Collaborative Perception.

However, there's a big problem: Not all cars are built the same.

Car A might have a super-precise 3D laser scanner (LiDAR) that sees the world as a cloud of dots.
Car B might only have a standard camera that sees the world as a 2D photo.
Car C might use a different type of laser scanner entirely.

If Car A tries to share its "dot cloud" with Car B, Car B doesn't understand the language. It's like trying to explain a recipe to someone who only speaks a different language.

The Old Way: The "Translator" Problem

Previously, to make these different cars talk, engineers had to build a custom translator for every single pair of cars.

If Car A talks to Car B, you need Translator #1.
If Car A talks to Car C, you need Translator #2.
If Car B talks to Car C, you need Translator #3.

This is a nightmare. If a new car joins the convoy, you have to build a whole new set of translators. It's expensive, slow, and doesn't scale.

The New Solution: GT-Space (The "Universal Blueprint")

The paper introduces a new system called GT-Space. Instead of forcing the cars to learn each other's languages, they all agree to speak a Universal Language based on the "Ground Truth."

What is "Ground Truth"?
Imagine a teacher with the answer key. In this case, the "Ground Truth" is the perfect, computer-generated map of where every car, pedestrian, and tree actually is, including their exact size and shape.

How GT-Space Works (The Analogy):

The Universal Blueprint:
The system creates a "Universal Blueprint" (the Common Feature Space). This blueprint isn't a photo or a dot cloud; it's a standardized grid that says, "Here is a car, 5 meters long, at this specific coordinate." It's the "truth" that everyone agrees on.
The "Adapter" (The Translator):
Instead of building a translator for every pair of cars, each car just needs one small adapter.
- The Laser Car takes its dot cloud and uses its adapter to convert it into the "Universal Blueprint."
- The Camera Car takes its photo and uses its adapter to convert it into the "Universal Blueprint."
- Now, everyone is speaking the same language! They can all send their blueprints to a central hub.
The Fusion Hub:
A central computer (the Fusion Network) takes all these blueprints, combines them, and creates a super-clear picture of the road. Because everyone is speaking the same language, the computer doesn't get confused.
The Secret Sauce: Contrastive Learning:
To make sure the adapters work perfectly, the system uses a training trick called "Contrastive Learning."
- Imagine a game of "Hot and Cold." The system tells the adapters: "If you are looking at the same car, your blueprints should look very similar (Hot). If you are looking at different cars, they should look very different (Cold)."
- By playing this game with every possible combination of cars, the system learns to handle any mix of sensors, even ones it hasn't seen before.

Why is this a Big Deal?

Plug-and-Play: If a new type of car (say, a drone with a weird sensor) joins the group, you don't need to retrain the whole system. You just give the drone its own small adapter, and it instantly fits in.
Stronger Team: Even if one car has a bad camera or a weak sensor, the system can still work well because the "Universal Blueprint" acts as a strong guide. The good sensors help fix the bad ones.
No More Re-training: The old methods required retraining the cars' brains every time a new partner joined. GT-Space keeps the cars' brains frozen and only trains the tiny adapter. It's fast and efficient.

The Result

The authors tested this on simulated traffic and real-world data. They found that GT-Space was better at spotting cars and obstacles than any previous method, especially when the cars had very different sensors.

In short: GT-Space solves the "Tower of Babel" problem in self-driving cars. Instead of forcing everyone to learn every other language, it gives everyone a common dictionary (the Ground Truth Blueprint) and a simple translator (the Adapter), so the whole team can work together seamlessly.

1. Problem Statement

In autonomous driving, Collaborative Perception (CP) allows multiple agents (vehicles, infrastructure) to share sensory data to extend their perception range. While effective, current methods struggle with heterogeneity:

The Challenge: Agents often possess different sensor modalities (e.g., LiDAR vs. Camera) or different model architectures (e.g., PointPillar vs. SECOND). This creates a "domain gap" where feature representations are misaligned in semantics and granularity.
Limitations of Existing Solutions:
- Encoder Retraining: Requires retraining the encoder of every collaborating agent to match the ego agent's feature space. This is computationally expensive and unscalable in open environments.
- Feature Interpreters: Requires the ego agent to maintain a unique interpreter module for every type of heterogeneous agent. This leads to a scalability bottleneck ( $O(N^2)$ complexity) and limits performance to the capacity of the ego's interpreter.
- End-to-End Training: Often requires training the entire system for specific modality combinations, making it inflexible when new agents join.

2. Methodology: GT-Space

The authors propose GT-Space, a scalable framework that aligns heterogeneous features using a Ground Truth (GT) derived Common Feature Space.

A. Core Concept: The Common Feature Space

Instead of learning a latent space purely from data, GT-Space constructs a reference space explicitly from ground-truth object labels.

Construction:
1. Encoding: 3D bounding box annotations ( $x, y, z, l, w, h, r, c$ ) are encoded into vectors using Fully Connected (FC) layers and Layer Normalization.
2. Mapping: These vectors are mapped onto a Bird's-Eye View (BEV) grid using Multi-Layer Perceptrons (MLP) and sinusoidal position embeddings.
3. Aggregation: If multiple objects overlap a grid cell, their features are summed.
4. Supervision: A detection head is trained on these GT-BEV features to ensure they can be decoded back into bounding boxes (using IoU loss), guaranteeing the features are semantically meaningful.

B. Heterogeneous Feature Alignment

Projectors: Each agent $a$ is equipped with a lightweight, modality-specific Projector ( $\Phi_a$ ).
Mechanism: The projector maps the agent's local BEV features ( $F_a$ ) into the Common GT Feature Space ( $F_{GT}$ ).
Training: The encoder and detection head of the agent are frozen. Only the projector is trained to minimize the distance between the projected features and the GT features.
Benefit: When a new agent joins, it only needs to train its specific projector, enabling "plug-and-play" collaboration without retraining the fusion network or other agents.

C. Fusion Network & Contrastive Learning

Architecture: A Transformer-based fusion network aggregates the aligned features from multiple agents.
Combinatorial Contrastive Loss: To ensure the model handles arbitrary modality combinations, the fusion network is trained using a combinatorial contrastive loss.
- For any pair of modalities $(m, m')$ , the fused features are compared against the GT features.
- The loss maximizes similarity between fused features and GT features for the same object (positive pairs) while minimizing similarity for different objects (negative pairs).
- This is computed across all possible modality pairs during training, allowing the model to generalize to unseen combinations at inference.

3. Key Contributions

GT-Derived Common Feature Space: Introduces a novel alignment strategy using ground-truth labels as a universal reference, eliminating the need for pairwise feature adaptation or encoder retraining.
Scalable Plug-and-Play Framework: Enables new agents with unseen modalities to join the system by training only a lightweight projector, significantly reducing deployment costs.
Combinatorial Contrastive Training: Proposes a training strategy that covers all modality pairs, ensuring robust feature fusion regardless of the input combination.
State-of-the-Art Performance: Demonstrates superior detection accuracy and robustness compared to existing heterogeneous collaboration methods.

4. Experimental Results

The method was evaluated on three datasets: OPV2V (simulation, V2V), V2XSet (simulation, V2I/V2V), and RCooper (real-world).

Detection Accuracy: GT-Space consistently outperformed baselines (including HM-ViT, PnPDA, HEAL, and STAMP) in terms of Average Precision (AP@50 and AP@70).
- Example: On OPV2V, GT-Space achieved 89.1% AP@50 for LiDAR-LiDAR collaboration, surpassing the next best (HEAL at 88.7%).
- Heterogeneity: The performance gap was largest for highly heterogeneous pairs (e.g., LiDAR + Camera), where GT-Space significantly outperformed interpreters that struggled to recover spatial information for camera agents.
Robustness:
- Under-performing Agents: The system remained robust even when collaborating with agents having weaker local perception models.
- Pose Errors: Maintained high performance under Gaussian noise added to agent poses (localization errors).
- Communication Latency: Outperformed baselines even with up to 500ms of simulated communication latency.
Ablation Studies:
- Removing the Projector caused the largest performance drop, confirming the necessity of domain alignment.
- Removing Combinatorial Contrastive Loss reduced accuracy, proving its role in generalizing across modalities.
Efficiency: The method adds minimal computational overhead (fusion time ~1.48ms) compared to baselines, as it avoids complex retraining or heavy interpreter modules.

5. Significance

GT-Space addresses a critical bottleneck in the deployment of autonomous driving fleets: interoperability.

Scalability: It solves the "combinatorial explosion" problem where every new agent type requires a new adapter or retraining. By using a ground-truth anchor, the system becomes modular.
Real-World Applicability: The ability to handle real-world datasets (RCooper) with mixed sensor types and imperfect localization makes it a viable candidate for large-scale V2X (Vehicle-to-Everything) systems.
Future Direction: The authors note that while the current method relies on ground-truth annotations for training, future work aims to extend this to weakly-supervised settings for broader real-world deployment.

In summary, GT-Space shifts the paradigm from "learning to align specific pairs" to "aligning everything to a universal truth," offering a robust, scalable, and high-performance solution for heterogeneous collaborative perception.