CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

The Big Picture: Teaching a Robot to "See" Without a Teacher

Imagine you are trying to teach a robot to drive a car. To do this safely, the robot needs to understand the world in 3D. It has two main eyes:

Cameras: They see colors, textures, and signs (like "Stop" or "Pedestrian Crossing").
LiDAR: It shoots laser beams to measure exact distances and shapes (like a 3D map of the road).

Usually, to teach a robot, humans have to sit down and draw boxes around every car, person, and tree in thousands of hours of video and laser data. This is like hiring an army of teachers to grade every single homework assignment. It's expensive, slow, and boring.

CLAP is a new method that lets the robot teach itself without any human teachers (labels). It looks at the raw data and figures out the rules of the world on its own.

The Problem: The "Too Much Data" Bottleneck

In the past, researchers tried to teach the robot using a technique called Differentiable Rendering. Think of this as the robot trying to "paint" a picture of what it thinks the world looks like, then comparing its painting to the real photo to see where it made mistakes.

However, there was a huge problem: The data is too big.
Imagine trying to paint a massive mural where every single pixel and every single laser point is a tiny brushstroke. Even the most powerful supercomputers (GPUs) would run out of memory trying to process the whole thing at once.

Because of this, previous methods had to teach the "Camera Brain" and the "LiDAR Brain" separately.

The Camera Brain learned about colors but didn't know about 3D shapes.
The LiDAR Brain learned about shapes but didn't know about colors or text.
The Result: They never learned to work together, missing out on the best parts of both.

The Solution: Enter CLAP

The authors created CLAP (Curvature sampLing and leArnable Prototype). It solves the problem with two clever tricks:

1. Curvature Sampling: "The Highlighter Pen"

Instead of trying to read every single word in a 1,000-page book (which takes forever), imagine using a highlighter pen to only mark the important words.

The Old Way: The robot looked at the flat road and the side of a car with the same amount of attention. But the flat road is boring; it's just a flat plane. The car, however, has curves, wheels, and windows.
The CLAP Way: The robot calculates the "Curvature" (how curved a surface is).
- Flat Road: Low curvature = Low importance. The robot ignores most of it.
- Car/Tree: High curvature = High importance. The robot focuses its energy here.
The Analogy: It's like a student studying for a test. Instead of reading the whole textbook, they use a highlighter to mark the complex diagrams and key terms. This saves time and lets the robot process both the camera and LiDAR data at the same time.

2. Learnable Prototypes: "The Lego Buckets"

Now that the robot can look at both data sources together, how does it understand that a "red blob" in the camera and a "boxy shape" in the LiDAR are the same thing (a car)?

The Old Way: They were in different rooms speaking different languages.
The CLAP Way: They use Learnable Prototypes. Imagine a set of empty Lego buckets floating in the middle of the room.
- The robot tries to sort every piece of data (a pixel or a laser point) into one of these buckets.
- One bucket might become the "Car Bucket." Another might become the "Road Bucket." Another the "Tree Bucket."
- The robot learns to put the "red pixel" from the camera and the "boxy laser point" from the LiDAR into the same bucket.
The Magic: By forcing them into the same bucket, the robot learns a common language. It realizes that "Red + Boxy Shape = Car."

To make sure these buckets don't all collapse into one giant, useless bucket (where everything looks the same), the researchers added a special rule (Gram Matrix Regularization) that forces the buckets to stay distinct, like keeping different colored Lego bins separate.

The Results: Why It Matters

The researchers tested CLAP on real driving datasets (NuScenes and Waymo).

The Score: When they tested the robot on a new task (finding cars) with very little data (only 5% of the usual training), CLAP was 100% better than the previous best methods.
The Analogy: If the old methods were like a student who studied alone and got a B, CLAP is the student who studied with a tutor, used a highlighter, and got an A+.
Scaling: The more data they gave it, the smarter it got. This suggests that if we feed CLAP even more data in the future, it could become incredibly powerful.

Summary

CLAP is a new way to teach robots to drive without human teachers.

It uses a Highlighter (Curvature Sampling) to ignore boring flat surfaces and focus on interesting shapes, saving computer memory.
It uses Lego Buckets (Prototypes) to force the camera and laser data to agree on what objects are, creating a shared understanding.

The result is a robot that learns faster, understands the world better, and is ready for the future of autonomous driving.

1. Problem Statement

The paper addresses the high cost and difficulty of labeling multimodal 3D data (LiDAR point clouds and camera images) for autonomous driving perception tasks. While unsupervised pre-training has shown promise in reducing this burden, existing state-of-the-art (SOTA) methods face a critical limitation: computational constraints.

The Bottleneck: Processing high-dimensional multimodal data (full point clouds and high-resolution images) simultaneously via differentiable rendering is too memory-intensive for current GPUs (often limiting batch sizes to 1).
The Consequence: Existing methods (e.g., UniPAD) must pre-train LiDAR and camera encoders separately. This prevents the model from exploiting the complementary nature of the two modalities: images provide high-level semantics, while point clouds provide precise 3D geometry.
The Goal: Develop a method for joint unsupervised pre-training of fusion perception models that overcomes memory constraints and effectively leverages the interplay between 2D image semantics and 3D geometric structure.

2. Methodology: CLAP

The authors propose CLAP (Curvature sampLing and leArnable Prototype), a joint pre-training framework based on differentiable rendering. The pipeline consists of four key components:

A. Curvature Sampling (Addressing Computational Cost)

To enable joint processing of images and point clouds without exceeding GPU memory, CLAP introduces a non-uniform sampling strategy.

Observation: Flat surfaces (e.g., road planes) contain redundant information, while high-curvature surfaces (e.g., vehicle edges) contain more informative geometric details.
Mechanism:
1. Estimate the Signed Distance Field (SDF) from the 3D features.
2. Compute the surface normal ( $\mathbf{n}$ ) by taking the first derivative of the SDF.
3. Calculate the geodesic curvature ( $\mathbf{c}$ ) by taking the derivative of the normalized normal vector.
4. Use the norm of the curvature as a sampling weight to select informative points ( $N_L$ ) and pixels ( $N_C$ ) for the reconstruction loss.
Efficiency: This reduces the number of sampled points/pixels significantly (e.g., $N_L \ll N_P$ ) while focusing on geometrically complex regions, making joint pre-training feasible.

B. Learnable Prototypes (Common Feature Space)

To bridge the gap between modalities, CLAP uses a set of learnable prototypes ( $\mathbf{K}$ ) to represent "parts" of the 3D scene in a shared feature space.

Expectation-Maximization (EM) Training:
- E-step: Compute the probability of assigning each 3D embedding (from LiDAR or Camera branches) to a specific prototype using a Softmax operation.
- M-step: Maximize the similarity between embeddings and their assigned prototypes by minimizing the entropy of the assignment matrix.
Goal: This forces the network to learn a common representation where both modalities map to the same semantic "parts" of the scene.

C. Swapping Prediction Loss (Modality Interaction)

To explicitly model the interaction between image semantics and LiDAR geometry, CLAP employs a Swapping Prediction Loss (inspired by SwAV).

The embeddings from one modality (e.g., LiDAR) are used to predict the prototype assignments of the other modality (e.g., Camera), and vice versa.
This encourages the model to align the geometric structure of the point cloud with the semantic content of the image.

D. Gram Matrix Regularization (Preventing Collapse)

A common issue in prototype learning is "collapse," where all prototypes converge to the same vector.

Solution: A Gram Matrix Regularization term is added to the loss function. It minimizes the similarity between different prototypes (off-diagonal elements of the Gram matrix $\mathbf{G} = \mathbf{K}\mathbf{K}^\top$ ), ensuring that each prototype learns to represent a distinct part of the scene.

E. Overall Objective

The total loss combines:

Rendering Loss ( $L_{rend}$ ): Masked reconstruction of range (SDF) and RGB values via differentiable rendering.
Prototype Loss ( $L_{proto}$ ): A weighted sum of EM loss, Swapping prediction loss, and Gram Matrix regularization.

3. Key Contributions

First Joint Differentiable-Rendering Pre-training: CLAP is the first method to successfully perform joint unsupervised pre-training for fusion perception using differentiable rendering, overcoming the memory bottleneck via Curvature Sampling.
Curvature Sampling Strategy: A novel sampling technique that prioritizes high-curvature (informative) regions, enabling efficient processing of large-scale 3D scenes.
Prototype-Based Modality Alignment: The introduction of learnable prototypes trained via an EM scheme and Swapping Prediction Loss to create a shared feature space that captures the complementarity of LiDAR and Camera data.
Stability Mechanism: The use of Gram Matrix Regularization to prevent prototype collapse, ensuring robust learning of diverse scene parts.

4. Experimental Results

The method was evaluated on NuScenes and Waymo datasets using downstream 3D object detection tasks (BEVFusion for NuScenes, CenterPoint for Waymo) with few-shot fine-tuning (5% and 1% data).

NuScenes Performance:
- CLAP achieved a 51.17% mAP and 57.04% NDS on the 5% fine-tuning setting.
- This represents a +2.48% mAP improvement over random initialization.
- Comparison: This is a 100% relative improvement over the previous SOTA method, UniPAD (which only gained +1.12% mAP). CLAP outperformed all other baselines (ALSO, OCC-MAE, SLidR, PPKT).
Waymo Performance:
- CLAP achieved the best performance at convergence, with a +1.28% gain over random initialization, roughly double the improvement of the best previous pre-training method.
Scaling Property:
- Experiments showed that as the ratio of pre-training data to fine-tuning data increases (i.e., using less fine-tuning data like 0.5%), CLAP's performance gains become more pronounced (up to +7.22% mAP with 0.5% data), suggesting strong potential for scaling.
Ablation Studies:
- Removing Curvature Sampling (using uniform sampling) resulted in performance drops, confirming its necessity for joint training.
- Removing Prototype Learning reduced performance, confirming the value of the shared feature space and modality interaction.

5. Significance

Efficiency: By solving the memory bottleneck, CLAP enables the first true joint pre-training of multimodal 3D perception, allowing models to learn from the synergy of vision and geometry rather than treating them in isolation.
Data Efficiency: The method significantly reduces the reliance on expensive 3D annotations, making it highly valuable for autonomous driving development where labeling is a major bottleneck.
Generalization: The results on both NuScenes and Waymo, along with the scaling analysis, demonstrate that CLAP is a robust and scalable approach for unsupervised 3D representation learning.
Future Impact: The framework sets a new standard for how differentiable rendering and prototype learning can be combined to bridge the gap between 2D and 3D perception in an unsupervised manner.