Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise

Imagine you are teaching a robot to drive a car. To do this safely, the robot needs a perfect 3D map of the world around it, knowing exactly where the road is, where the trees are, and where the other cars are. This is called 3D Semantic Occupancy Prediction.

However, there's a big problem: the "teacher" giving the robot the map is unreliable.

The Problem: The "Glitchy" Teacher

In the real world, getting perfect 3D maps is hard. Sometimes the sensors get confused by rain, sometimes they get tricked by fast-moving cars leaving "ghost trails" (like a smear on a window), and sometimes the data just gets scrambled.

The researchers asked a scary question: "If the teacher is lying to us 90% of the time, can the student (the robot) still learn to drive?"

They found that if you take the standard methods used to teach robots and just feed them this bad data, the robot's brain completely breaks. It forgets what a car looks like, thinks a tree is a road, and the whole 3D map collapses into a mess. It's like trying to learn French from a dictionary where every word has been randomly replaced with a different language.

The Solution: DPR-Occ (The Smart Detective)

The authors created a new system called DPR-Occ. Instead of blindly trusting the noisy teacher or just trying to "ignore" the bad data, this system acts like a smart detective using two different sources of information to figure out the truth.

Here is how it works, using a simple analogy:

1. The Two Sources of Clues

Imagine you are trying to identify a blurry photo of an animal.

Source A (The Memory Bank): You ask a wise, experienced teacher who has seen thousands of photos. Even if the current photo is blurry, the teacher remembers what similar animals usually look like. This is the EMA Teacher in the paper—a model that remembers past patterns.
Source B (The Shape Matcher): You look at the shape of the object. Does it have four legs? Does it have a tail? You compare the shape to a mental library of animal shapes. This is the Prototype Affinity in the paper—matching the 3D shape to known categories.

2. The "Maybe" List (Partial Labeling)

Instead of forcing the robot to guess "This is definitely a dog," the system creates a "Maybe List."

It looks at the Memory Bank and says, "It looks like a dog or a wolf."
It looks at the Shape Matcher and says, "It has the shape of a dog or a fox."
It combines these to say, "Okay, it's probably a dog, but let's keep 'wolf' and 'fox' in the running just in case."

By keeping a small list of possibilities instead of a single, rigid guess, the system avoids getting tricked by the noise. If the noisy teacher says "This is a toaster," the system checks its list, sees that "toaster" isn't on the "Maybe List" based on shape and memory, and ignores the teacher's lie.

3. The "Don't Do This" Rule (Negative Learning)

The system also learns what not to do. If the teacher says "This is a toaster," and the system knows for a fact it's not a toaster, it actively punishes the idea of it being a toaster. This helps clean up the confusion.

The Results: Saving the Robot's Brain

The researchers tested this on a benchmark they built called OccNL (which is like a "Driving School for Robots with Bad Teachers").

The Old Way: When the noise was high (90% of the labels were wrong), the old methods failed completely. The robot's map turned into static noise. It couldn't tell the difference between a road and a sky.
The New Way (DPR-Occ): Even with 90% of the data being garbage, the robot still built a solid, safe map. It kept the roads straight and the cars in the right place.

Why This Matters

Think of it like this: If you are learning to drive in a city where the street signs are randomly changed every day, a normal student would crash. But a student with DPR-Occ would look at the road layout, remember where the traffic usually flows, and ignore the crazy signs.

This research proves that for robots to be safe in the real world (where data is always messy), they can't just memorize labels. They need to understand the structure of the world and use their memory to filter out the lies. This makes autonomous driving much safer and more reliable, even when the sensors aren't perfect.

Here is a detailed technical summary of the paper "Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise".

1. Problem Statement

3D Semantic Occupancy Prediction is a critical task for autonomous driving and robotics, aiming to infer dense voxel-grid representations of the environment (both occupancy and semantic labels). However, real-world voxel annotations are inherently unreliable due to:

Structural Artifacts: Imperfections in the voxelization process and cross-frame fusion.
Dynamic Trailing Effects: "Ghost" voxels caused by the rapid motion of dynamic objects (e.g., cars, pedestrians) in multi-frame aggregation.
Label Noise: The resulting annotations contain structural artifacts and dynamic trailing noise, leading to erroneous supervision during training.

The Core Question: Can autonomous systems safely rely on such unreliable occupancy supervision?
The Gap: Existing state-of-the-art (SOTA) label-noise learning strategies, developed primarily for 2D image classification, fail catastrophically when applied to sparse 3D voxel spaces. They struggle with the inherent sparsity and irregularity of 3D data, often leading to "semantic extinction" (where rare classes disappear) and structural collapse under high noise levels.

2. Methodology: DPR-Occ

The authors propose DPR-Occ (Dual-source Partial-label Reasoning for Occupancy), a principled framework designed to construct reliable supervision from noisy data. The framework operates in two stages:

A. Warm-up Stage

The model is trained on the noisy dataset using standard supervised loss ( $L_{base}$ ) to exploit the memorization effect of Deep Neural Networks (DNNs), allowing the model to learn clean patterns before noise dominates.
An Exponential Moving Average (EMA) Teacher is updated simultaneously to provide stable, noise-resistant predictions.
Class Prototypes are maintained, fusing scene-adaptive and scene-agnostic features to capture global semantic distributions.

B. Robust Learning Stage

Once the warm-up is complete, the model switches to a robust optimization strategy involving Dual-Source Partial Label Construction:

Source 1: EMA Teacher Consensus (Semantic Evidence): Uses the probability distribution from the EMA teacher to provide a stable semantic consensus, mitigating prediction oscillations.
Source 2: Prototype Affinity (Structural Evidence): Calculates cosine similarity between voxel features and class prototypes to enforce structural consistency.
Union Fusion: The candidate label set ( $PL_v$ ) for each voxel is formed by the union of the Top- $K$ predictions from both sources. This ensures the ground truth is retained even if one source fails.

Dynamic-K Scheduling:
The size of the candidate set ( $K$ ) is dynamically adjusted. It starts large to maximize ground-truth coverage during early training and gradually decays to minimize disambiguation difficulty and enhance purity as the model becomes more certain.

Joint Optimization Objective:
The model is optimized using three complementary losses:

Partial Label Learning (PLL): Guides the model to allocate probability mass within the candidate set.
Negative Learning (NL): Penalizes categories in the complementary set (those deemed unreliable by both sources) to suppress noise propagation.
Self-Not-True Distillation (SNTD): A regularization term that aligns the student's distribution with the EMA teacher in the "not-true" space (excluding the noisy label), preventing overfitting to corrupted labels.

3. Key Contributions

OccNL Benchmark: The first systematic benchmark for 3D semantic occupancy under label noise. It introduces two types of noise:
- Synthetic Occupancy-Asymmetric Noise: Simulates category flipping and spurious returns.
- Real-World Dynamic Trailing Noise: Simulates spatiotemporal inconsistencies from moving objects.
- The benchmark adapts five SOTA image-domain noise-robust strategies (AGCE, ANL, JAL, VBL, SNTD) to 3D for comparison.
DPR-Occ Framework: A novel dual-source reasoning framework that synergizes temporal model memory (EMA) and representation-level structural affinity (prototypes) to dynamically expand and prune candidate label sets.
Comprehensive Analysis: The paper reveals a fundamental domain gap where 2D noise-robust strategies collapse in 3D sparse spaces, whereas DPR-Occ preserves geometric and semantic integrity.

4. Experimental Results

Experiments were conducted on the SemanticKITTI dataset under varying noise levels (up to 90% label corruption).

Performance under Extreme Noise:
- At 90% noise, existing baselines (e.g., AGCE, SNTD) suffer catastrophic failure, with mIoU dropping to near-zero (e.g., SNTD drops to 1.16% mIoU).
- DPR-Occ maintains robustness, achieving 8.23% mIoU and 35.03% geometric IoU at 90% noise.
- Compared to baselines, DPR-Occ shows significant gains: up to 2.57% mIoU and 13.91% IoU improvements.
Dynamic Trailing Noise:
- Under severe trailing noise, baselines struggle to distinguish between true objects and "ghost" voxels, leading to performance degradation.
- DPR-Occ remains stable, maintaining the highest mIoU across Mild, Moderate, and Severe settings, effectively disentangling semantic features from structural noise.
Ablation Studies:
- Dual-Source Fusion: Combining EMA predictions and Prototype affinity yields the best results (11.47% mIoU), proving the sources are complementary.
- Dynamic-K: Linear scheduling of $K$ outperforms fixed or random strategies, balancing coverage and purity.
- Component Analysis: The combination of PLL, NL, and SNTD is crucial; removing any component leads to performance drops.

5. Significance

Safety-Critical Perception: The work demonstrates that autonomous systems cannot rely on standard supervision in dynamic environments. DPR-Occ provides a reliable foundation for safety-critical tasks by preserving geometric integrity even when annotations are severely corrupted.
Paradigm Shift: The paper highlights that robust 3D perception differs fundamentally from 2D image learning. It relies more on hypothesis-space control (restricting the search space via partial labels and structural constraints) than on simple loss reweighting or stronger penalization.
Future Impact: By bridging label noise learning and 3D perception, this work paves the way for deploying robust autonomous systems in real-world scenarios where perfect ground truth is unattainable. The benchmark and code are made publicly available to foster further research.