GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection

Imagine you are trying to teach a robot how to drive a car. To do this, you need to show it millions of pictures of roads, cars, and pedestrians, and tell it exactly what everything is. This is called labeled data. But here's the problem: labeling these pictures is like hiring a team of experts to draw a box around every single car in every photo. It takes forever and costs a fortune.

So, researchers came up with a clever idea: Semi-Supervised Learning. This is like hiring a few experts to label a small batch of photos, and then letting the robot learn from the rest of the unlabeled photos on its own. The robot makes a guess, and if it's confident enough, it treats that guess as a "truth" to learn from later.

However, there's a catch. When the robot only has a few labeled examples, it gets really good at recognizing what an object is (e.g., "That's a car"), but it gets confused about where it is and what shape it actually has. It's like knowing a dog is a dog, but not knowing if it's sitting, standing, or running, or if it's a tiny Chihuahua or a giant Great Dane.

Enter GeoTeacher, the new method described in this paper. Think of GeoTeacher as a "Geometry Sensei" for the robot.

The Two Main Tricks of GeoTeacher

GeoTeacher helps the robot learn better by focusing on geometry (the shape and structure of things) using two main tricks:

1. The "Keypoint Connection Game" (Geometric Relation Supervision)

Imagine you are teaching a child to recognize a human face. You don't just say "that's a face." You point out the relationships: "The eyes are above the nose, the mouth is below the nose, and the ears are on the sides."

GeoTeacher does this for 3D objects.

The Teacher: A smart, pre-trained robot (the "Teacher") looks at an object (like a car) and picks out special points: the center, the corners, and the middle of the edges.
The Student: The learning robot (the "Student") tries to do the same.
The Lesson: Instead of just copying the final answer, the Teacher forces the Student to understand the relationships between those points. "If the center point is here, the corner point must be there."
Why it helps: Even if the Teacher isn't 100% sure about the car's location (because the data is messy), it can still teach the Student the shape of the car. This helps the Student understand the object's internal structure, not just its surface.

2. The "Distance-Dependent Shaking" (Voxel-wise Data Augmentation)

Imagine you have a box of LEGO bricks representing a car. To make the robot smarter, you want to show it cars that are broken, missing pieces, or partially hidden. This is called "data augmentation."

The Problem: If you shake the LEGO box too hard, you might break the car completely, and the robot won't learn anything. Also, if you shake a car that is far away (which is already blurry and made of few LEGO bricks), you might destroy it entirely.
The Solution: GeoTeacher uses a "Distance-Decay" strategy.
- Nearby Objects: If a car is close to the robot, the system "shakes" it a lot. It removes some bricks or rearranges them to simulate a car that is partially hidden or damaged. This forces the robot to learn how to recognize a car even when it looks weird.
- Distant Objects: If a car is far away, the system is gentle. It barely touches it. This is because distant objects are already sparse (made of few points), and messing with them too much would make them unrecognizable.

Why This is a Big Deal

Most previous methods tried to make the robot's "brain" (its internal features) look similar to the Teacher's brain. But GeoTeacher realized that shape and structure are the most important things for 3D detection.

By teaching the robot to understand the geometry (the skeleton of the object) and by practicing on messy, broken versions of objects (but only when they are close), GeoTeacher allows the robot to become a much better driver with far less labeled data.

The Results

The researchers tested this on two massive datasets (ONCE and Waymo), which are like giant libraries of driving scenes.

The Outcome: GeoTeacher beat the current best methods. It found more cars, pedestrians, and cyclists, especially in tricky situations where objects were far away or partially hidden.
The Analogy: If other methods were like a student who memorized the answers to a test, GeoTeacher is like a student who actually understands the subject matter. Even if the test questions are weird or the data is messy, GeoTeacher can still figure it out.

In a Nutshell

GeoTeacher is a new way to teach robots to see 3D objects. It acts like a geometry tutor, teaching the robot not just what things are, but how they are shaped and how their parts fit together. It also practices with "broken" versions of objects to make the robot tougher, but it's smart enough to know when to be gentle with distant objects. The result? A robot that learns faster, needs fewer expensive labels, and drives safer.

1. Problem Statement

Semi-Supervised 3D Object Detection (SS3D) aims to leverage large amounts of unlabeled LiDAR data to improve detection performance when labeled data is scarce. While existing SS3D methods utilize teacher-student frameworks to generate pseudo-labels or enforce feature-level consistency, they suffer from a critical limitation:

Neglect of Geometric Information: Current methods often overlook the inherent geometric relations within objects. With limited labeled data, models struggle to capture the internal structural geometry (e.g., relative positions of parts, orientation, and boundaries) of objects.
Data Diversity Issues: Standard augmentation often treats the entire scene uniformly, failing to specifically address the geometric diversity of individual objects or the sparsity of distant objects, leading to poor generalization on unseen structures.

2. Methodology: GeoTeacher

The authors propose GeoTeacher, a novel framework that guides the student model to learn geometric information from both data augmentation and supervision perspectives. The framework operates in two phases: training a high-performance teacher and then using it to supervise the student via two core modules.

A. Geometric Relation Supervision (GRS)

This module transfers knowledge of object geometry from the teacher to the student.

Keypoint Selection: Instead of relying solely on global features, the method selects specific keypoints based on 2D bounding boxes projected to the Bird's-Eye-View (BEV):
- Center Points: Stable references for localization.
- Edge Midpoints: Capture orientation and spatial extent.
- Corner Points: Encode fine-grained boundary information.
Relation Modeling: The method computes a geometric relation matrix by measuring the cosine similarity between feature representations of these keypoints. This matrix captures high-order structural dependencies (e.g., the relationship between a car's front corner and its center).
Loss Function: The student is trained to mimic the teacher's geometric relation matrix. To handle unreliable pseudo-labels, a confidence-aware weighting mechanism is applied, where the loss is weighted by the teacher's classification scores.
- Formula: $L_{GRS} = \sum s_k \cdot L_{\delta}$ , where $s_k$ is the pseudo-label score and $L_{\delta}$ is the L1 loss between student and teacher relation matrices.

B. Distant-Decay Voxel-wise Data Augmentation (DVA)

This strategy increases the geometric diversity of objects during training, specifically addressing the challenge of occlusion and sparsity.

Voxel-wise Decomposition: Unlike scene-level augmentation, DVA decomposes individual objects into small voxels ( $n_l \times n_w \times n_h$ ).
Augmentation Operations:
- Sparsify: Randomly samples points within voxels to simulate sparse point clouds.
- Ordered Dropout: Removes points in a spatial sequence (clockwise/counterclockwise) to simulate occluded surfaces.
Distance-Decay Mechanism: To prevent the destruction of geometry for distant objects (which are already sparse and hard to detect), the probability of applying augmentation ( $p$ $p$ ) decays exponentially with distance from the sensor:
- Formula: $p = c \cdot \exp(-\frac{\sqrt{i^2+j^2}}{d_{range}})$ .
- This ensures aggressive augmentation for nearby objects while preserving the integrity of distant ones.

3. Key Contributions

Novel SS3D Framework: Introduction of GeoTeacher, which explicitly guides student models to learn intrinsic geometric relations, complementing existing feature-level or pseudo-label approaches.
Geometric Relation Supervision: A new module that models high-order geometric dependencies between keypoints (center, edge, corner) to improve the student's understanding of object structure.
Distance-Decay Voxel-wise Augmentation: A plug-and-play data augmentation strategy that enhances geometric diversity while adaptively protecting the detectability of distant objects.
State-of-the-Art Performance: The method is modular and can be combined with various existing SS3D baselines (e.g., ProficientTeacher, PTPM) to boost performance.

4. Experimental Results

The method was evaluated on the ONCE and Waymo Open datasets.

ONCE Dataset:
- Combined with ProficientTeacher, GeoTeacher achieved 63.16 mAP (Large protocol), outperforming the baseline by +1.76 mAP.
- Combined with PTPM, it reached 65.70 mAP, surpassing the baseline by +3.02 mAP.
- Notably, GeoTeacher + PTPM on the Small protocol (100k unlabeled) achieved performance comparable to PTPM on the Large protocol (1M unlabeled), demonstrating superior data efficiency.
Waymo Open Dataset:
- Under the 5% label regime, GeoTeacher + PTPM improved AP by +0.92 and APH by +0.81 over the baseline.
- The method even outperformed an "Oracle Model" (trained on full data) in certain configurations, highlighting the power of geometric learning.
Generalization: The approach showed consistent improvements when applied to different detector architectures (PV-RCNN, CenterPoint), proving its robustness.
Ablation Studies:
- Both GRS and DVA contributed independently to performance gains.
- GRS outperformed other feature-level distillation methods (e.g., NoiseDet, SOOD) by at least 0.58 mAP, confirming that geometric relations provide richer inductive biases than low-level feature similarity.
- The distance-decay mechanism significantly improved performance in the "50m-inf" range compared to standard augmentation.

5. Significance

GeoTeacher addresses a fundamental gap in semi-supervised 3D detection: the lack of structural understanding. By shifting focus from mere feature consistency to geometric relation modeling, the method enables models to "understand" the shape and structure of objects even with very few labeled examples. The distance-decay augmentation further demonstrates a sophisticated approach to handling the varying density of LiDAR point clouds. This work sets a new state-of-the-art for SS3D and offers a generalizable strategy for improving 3D perception in data-scarce autonomous driving scenarios.