Towards Camera Open-set 3D Object Detection for Autonomous Driving Scenarios

Imagine you are teaching a self-driving car how to see the world.

The Problem: The "Strict Librarian"

Currently, most self-driving cars are trained like strict librarians. They have a catalog of every book (object) they are allowed to recognize: Cars, Pedestrians, Bicycles.

If a librarian sees a book titled "The Great Gatsby," they know exactly what it is. But if they see a book titled "The Great Gator" (a giant alligator on the road), or a weird, unknown construction vehicle, they get confused. Because they were only trained on a specific list, they might ignore the alligator entirely, or worse, try to force it into the "Car" category. This is dangerous. In the real world, you can't predict every weird thing that might appear on the road.

This paper introduces a new system called OS-Det3D that teaches the car to be a curious explorer instead of a strict librarian. It wants the car to say, "I don't know what that is, but I definitely see something there, and I should slow down."

The Solution: A Two-Stage Detective Team

The authors built a two-step training process to teach the camera-based car how to spot these "unknowns."

Stage 1: The "Shape-Shifter" (ODN3D)

First, they use a special helper network called ODN3D. Think of this helper as a geometric detective who only looks at shapes and sizes, ignoring what things look like (color, texture, brand).

How it works: Usually, AI learns by looking at pictures of "Cars" and "Trucks." If it sees a "Bus," it might think, "That's not a car or a truck, so it must be background noise."
The Trick: This new detective uses data from LiDAR (a laser scanner that measures 3D distance) to find any object that looks like a solid 3D box, regardless of whether it's a car, a cow, or a pile of trash. It ignores the "name tag" and just asks, "Is there a solid object here?"
The Result: It generates a list of "suspects" (object proposals). However, because it's so open-minded, it sometimes gets confused by shadows or noise, creating a list with some "fake suspects" (false alarms).

Stage 2: The "Smart Filter" (Joint Selection)

Now we have a list of suspects, but it's messy. We need to clean it up before teaching the main camera system. This is where the Joint Selection Module comes in. Think of this as a smart filter or a quality control inspector.

The Problem: If we just take the "Shape-Shifter's" list and tell the camera, "These are all new objects," the camera might learn to recognize shadows as monsters.
The Solution: The inspector looks at the list from two angles:
1. The 3D Score: "Does this look like a solid object in 3D space?" (From Stage 1).
2. The Camera Score: "Does this look like a car or a pedestrian that I already know?" (From the camera's visual features).
The Magic: The inspector picks the items that have a high 3D score (it's definitely an object) but a low camera score (it doesn't look like anything I've seen before).
- Analogy: Imagine you are looking for a new type of fruit. You pick up a round, heavy object (High 3D score). You look at it, and it doesn't look like an apple, orange, or banana (Low "known" score). Bingo! That's your new fruit.
The Outcome: These "clean" unknown objects become Pseudo-Ground Truth. They are treated as "real" examples of new objects to teach the camera.

The Final Result: A Smarter Driver

After this two-stage training, the car's camera system becomes a hybrid expert:

It still knows all the usual suspects (Cars, Pedestrians) perfectly.
It can now spot the weird stuff (Unknown trucks, debris, strange animals) and label them as "Unknown Object," alerting the driver to be careful.

Why This Matters

In the past, if a self-driving car saw something it wasn't trained on, it might drive right into it because it didn't "see" it. With OS-Det3D, the car admits, "I don't know what that is, but I see it."

This is a huge leap forward for safety. It moves self-driving cars from being rigid rule-followers to adaptive, safety-conscious observers that can handle the messy, unpredictable reality of the real world.

Summary Analogy

Old System: A security guard who only stops people wearing a "Red Hat." If someone wears a "Blue Hat," the guard ignores them completely.
New System (OS-Det3D): A security guard who first uses a metal detector (LiDAR) to find anyone carrying something heavy. Then, they check if the person looks like a known criminal. If they carry something heavy but don't look like a known criminal, the guard stops them and says, "I don't know who you are, but you're suspicious. Let's investigate."

1. Problem Statement

The Challenge: Conventional camera-based 3D object detectors for autonomous driving operate under a closed-set assumption, meaning they are trained to recognize only a predefined set of object categories (e.g., cars, pedestrians). In real-world scenarios, these systems frequently encounter novel or unseen objects (e.g., debris, construction vehicles, or unusual obstacles).
The Risk: Closed-set detectors fail to identify these unknown objects, often misclassifying them as background or assigning them incorrect known labels. This poses a significant safety risk for autonomous vehicles.
The Gap: While open-set detection has been explored in 2D, extending it to 3D camera-based detection is difficult because:

RGB images lack reliable depth information compared to LiDAR.
Existing 3D proposal methods tend to overfit to known categories, treating unlabeled objects as background.
Generating high-quality "pseudo-ground truth" for unknown objects from camera data alone is noisy and unreliable.

2. Methodology: OS-Det3D

The authors propose OS-Det3D, a two-stage training framework designed to enable camera-based detectors to discover and detect unknown 3D objects. The framework leverages LiDAR data only during training to generate proposals, while the final inference relies solely on camera data.

Stage 1: 3D Object Discovery Network (ODN3D)

The goal is to generate class-agnostic 3D object proposals that can generalize to unseen objects.

Input: LiDAR point clouds (during training).
GeoHungarian Matching: Unlike standard Hungarian matching which penalizes unlabeled objects, the authors introduce a geometry-only matching strategy. It matches predictions to ground truth based strictly on geometric parameters (location and scale) without considering class labels. This prevents the model from overfitting to known categories.
3D Objectness Score: To filter proposals, the network predicts a "3D objectness score" ( $s'_{obj}$ $s_{o bj}^{'}$ ). This score measures the localization quality by combining:
- Centerness: Distance between the predicted and ground truth center points.
- Scale: A novel metric that reformulates width, length, height, and yaw angle into a matrix to calculate the L1 distance between predicted and ground truth scales, accounting for rotation.
Output: A set of 3D object proposals with objectness scores.

Stage 2: Joint Selection (JS) Module

The goal is to select high-quality pseudo-labels for unknown objects from the proposals generated in Stage 1 to train the camera detector.

The Problem: Proposals with high objectness scores might still be known objects (overfitting) or noisy data.
The Solution: The JS module computes a Joint Selection Score ( $s_{jos}$ $s_{j os}$ ) by combining two modalities:
1. 3D Objectness Score ( $s'_{obj}$ ): Indicates the likelihood that a proposal is an object (high is good).
2. BEV Feature Response ( $s_{fea}$ ): Extracted from the camera encoder (e.g., BEVFormer). It measures the visual similarity to known categories. A low response indicates the object looks different from known classes (likely unknown).
Formula: $s_{jos} = s'_{obj} \cdot (1 - s_{fea})$ .
Process: The module selects the top- $k_u$ candidates with the highest $s_{jos}$ as pseudo-ground truth for unknown objects. These are combined with known ground truth to retrain the camera detector.

Training Strategy

Stage 1: Train BEVFormer (camera) and ODN3D (LiDAR) independently on known classes.
Stage 2: Reload the pre-trained BEVFormer weights (excluding the classification head). Use the JS module to generate pseudo-labels for unknowns. Retrain the camera detector to recognize both known classes and the "unknown" class (grouped as a single category).

3. Key Contributions

ODN3D (3D Object Discovery Network): A novel network that uses LiDAR geometric cues and a geometry-only matching algorithm to generate class-agnostic proposals, effectively overcoming the overfitting to known categories common in standard 3D detectors.
Joint Selection Module: A cross-modality filtering mechanism that fuses 3D geometric confidence (from LiDAR-derived proposals) with visual feature responses (from camera BEV features) to accurately identify and filter pseudo-labels for unknown objects.
OS-Det3D Framework: A complete two-stage training pipeline that successfully adapts closed-set camera detectors to open-set scenarios, enabling them to detect unknown objects without requiring LiDAR during inference.

4. Experimental Results

The method was evaluated on the nuScenes and KITTI datasets.

Performance on nuScenes:
- OS-Det3D significantly outperformed baselines like BEVFormer (closed-set) and methods using OW-DETR or CA-3D.
- On nuScenes Split 2, it achieved an Unknown Recall (ARunk) of 31.8% and mAPknown of 43.4%, showing it improves unknown detection without sacrificing performance on known objects.
Performance on KITTI:
- The ODN3D component alone achieved a Recallunk of 74.4%, outperforming state-of-the-art LiDAR-based open-set methods like MLUC and OSIS.
- It improved APunk by 23.5% compared to MLUC, demonstrating superior pseudo-labeling capabilities.
Ablation Studies:
- Removing the Joint Selection module or using standard matching (Hungarian) significantly degraded performance.
- The specific design of the 3D Objectness Score (incorporating rotation and scale matrices) proved superior to standard IoU or OLN-based scores.

5. Significance

Safety Enhancement: This work addresses a critical safety gap in autonomous driving by enabling vehicles to "see" and react to unexpected obstacles that were not present in the training data.
Camera-Centric Solution: While it uses LiDAR for training supervision, the final system is camera-only, making it highly practical for cost-effective autonomous driving stacks that do not rely on expensive LiDAR sensors during operation.
Paradigm Shift: It moves the field from "closed-world" detection to "open-world" detection, providing a robust framework for handling the dynamic and unpredictable nature of real-world traffic environments.

In conclusion, OS-Det3D represents a significant advancement in autonomous perception, successfully bridging the gap between geometric 3D understanding and visual appearance to detect the unknown.