FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models

Imagine you are driving a car, but instead of having a high-tech 3D scanner (LiDAR) on the roof, you only have standard cameras. Your goal is to build a perfect, 3D "digital twin" of the road ahead, knowing exactly where the cars, pedestrians, and trees are, and even distinguishing between two different red cars driving side-by-side.

Usually, teaching a computer to do this requires showing it millions of labeled 3D examples, which is expensive and slow. If you take that trained computer to a new city with different street signs or lighting, it often gets confused.

FreeOcc is a new "magic trick" that solves this without any training. Here is how it works, using simple analogies:

1. The "No-Training" Philosophy

Think of traditional AI like a student who spends years memorizing flashcards of every possible street scene. If they see a street they've never seen before, they might fail.

FreeOcc is like a super-smart detective who already knows everything about the world. Instead of memorizing specific streets, it uses two powerful "foundation models" (pre-trained AI giants) that already understand how the world looks and how 3D space works. It doesn't need to study the new city; it just applies its general knowledge immediately.

2. The Two-Brain System

FreeOcc uses two specialized "brains" working together:

The "What" Brain (The Semantic Branch):
Imagine a super-artist who can look at a photo and instantly draw outlines around everything: "That's a car," "That's a pedestrian," "That's grass."
FreeOcc uses a tool called SAM3. Instead of just saying "car," you can whisper to it, "Show me the grass," or "Show me the buildings." It listens to these text prompts and draws perfect masks (outlines) around those objects in every camera view. It's like having a painter who follows your verbal instructions perfectly.
The "Where" Brain (The Geometric Branch):
Imagine a 3D architect who looks at a flat photo and instantly knows how far away every pixel is.
FreeOcc uses a tool called MapAnything. It takes the 2D photos and turns them into a cloud of 3D points, giving every dot a distance and a confidence score (how sure it is about that distance).

3. The Assembly Line (Putting it together)

Now, FreeOcc has a list of "what" (masks) and a list of "where" (3D points). Here is the assembly process:

Lifting: It takes the 2D outlines from the "What" brain and sticks them onto the 3D points from the "Where" brain. Now, every 3D point knows it is part of a "car" or a "tree."
Filtering: Just like a sieve, it shakes out the shaky points. If the "Where" brain isn't confident about a distance, or if a point is too far away, it gets thrown out. Only the reliable points stay.
Time Travel (Fusion): It looks at the last few seconds of video. If a car was seen from the left camera, then the front camera, then the right camera, it stitches all those views together into one solid 3D object.
The "Ghost" Buster (Instance Identification): This is the tricky part. Sometimes, the system might think one car is two cars because of a weird angle. FreeOcc has a special module that looks at the current view, fits a 3D box around the object, and says, "Nope, that's just one car." It merges duplicates and fixes the labels.
The Voxel Grid: Finally, it dumps all these clean 3D points into a giant 3D grid (like a giant Rubik's cube made of tiny blocks). Each block is painted with the correct color (semantic label) and given a unique ID (instance ID).

4. Why is this a Big Deal?

Instant Deployment: You can take this system to a completely new country, turn it on, and it works immediately. No "training phase" needed.
The "Teacher" Role: Even if you do want to train a fast, real-time AI for a self-driving car later, FreeOcc acts as a perfect teacher. It generates high-quality "homework" (pseudo-labels) that helps train other models to be much better than before.
Panoptic Power: Most systems can tell you "there is a car." FreeOcc can tell you "there is Car A and Car B, and they are different." It handles the complex task of counting and distinguishing individual objects in 3D space without ever seeing a 3D training example.

The Catch

The only downside is speed. Because it's using these massive, powerful "foundation models" on the fly, it's slower than a specialized, pre-trained network. It's like using a supercomputer to solve a math problem instantly versus using a calculator that was pre-programmed for that specific equation.

In summary: FreeOcc is a "plug-and-play" 3D vision system. It uses the general knowledge of giant AI models to build a perfect 3D map of the road, distinguishing every object and its identity, all without needing to learn the specific road it's driving on first.

1. Problem Statement

Autonomous driving and road infrastructure analysis require a dense 3D understanding of the environment. While LiDAR provides accurate geometry, it is costly and not always available, making camera-only 3D occupancy prediction a critical scalable alternative.

The Challenge: Recovering metric 3D structure from RGB images is inherently ambiguous due to depth uncertainty, occlusions, and dynamic objects.
Current Limitations: State-of-the-art methods typically rely on dense 3D supervision (LiDAR annotations), which is expensive to acquire and limits deployment to unseen domains or new semantic taxonomies.
Existing Weakly Supervised Approaches: Recent methods use foundation models to generate pseudo-labels for training downstream networks. However, these still require a training phase on target-domain data and often focus only on semantic occupancy, neglecting panoptic (instance-level) prediction.

Goal: The authors propose FreeOcc, a pipeline that performs training-free semantic and panoptic occupancy prediction directly at inference time, leveraging pretrained foundation models without requiring target-domain data or model optimization.

2. Methodology

FreeOcc is a multi-stage pipeline that transforms multi-view camera images and poses into a 3D panoptic occupancy grid. It consists of two main branches (Semantic and Geometric) followed by fusion, instance identification, and refinement.

A. Semantic Branch (2D Priors)

Model: Uses SAM3 (SegmentAnything 3), a promptable foundation segmentation model.
Prompting Strategy: Instead of using raw class names (which often fail), the system uses a handcrafted set of synonyms (e.g., using "grass" and "dirt" to prompt for "terrain").
Mask Fusion: For each view, multiple mask candidates are generated per prompt. The system fuses these by selecting the highest-scoring candidate for each pixel.
Taxonomy Mapping: A rule-based system remaps the specific prompt labels (e.g., "building") to the target taxonomy (e.g., "manmade"). It also handles class conflicts (e.g., "road" vs. "lane marking") using spatial "over/under" rules.

B. Geometric Branch (3D Reconstruction)

Model: Uses MapAnything, a foundation model for metric 3D reconstruction.
Output: Generates dense per-pixel 3D points, depth maps, and confidence maps.
Reliability Filtering: Points are filtered based on:
- Depth thresholds: $d_{min}$ and $d_{max}$ .
- Confidence thresholds: Log-scaled confidence scores are used to discard unreliable depth estimates.
Label Transfer: Valid 3D points inherit the semantic and instance labels from the Semantic Branch.

C. Instance Identification (Panoptic Specific)

To achieve panoptic occupancy (assigning unique IDs to "thing" classes like cars and pedestrians), the system avoids temporal fusion for dynamic objects to prevent "ghosting."

3D Box Fitting: Fits yaw-oriented 3D boxes to current-view instance candidates using PCA for orientation.
Filtering: Discards boxes with implausible sizes (e.g., a 5m box for a pedestrian) and removes geometric outliers via Interquartile Range (IQR) and PCA-based tests.
Merging & Re-assignment: Merges overlapping boxes of the same class based on Intersection-over-Smaller-Volume (IoSV). Points inside merged boxes get the instance ID; points outside are assigned to the nearest box or labeled "ignore."

D. Voxelization and Refinement

The fused point cloud is converted into a voxel grid with a deterministic 4-stage refinement stack:

Pinhole/Cavity Filling: Morphological closing to fill small holes in occupied regions.
Warmup Ego Completion: Fills the immediate ego-vehicle blind spot with "driveable surface" if no object evidence exists.
Conservative Neighborhood Coherence: Updates ambiguous voxels only if strong neighborhood agreement exists, protecting reliable and thin structures.
Background Cleanup & Instance Dilation: Reassigns "ignore" voxels and dilates instance IDs to fill gaps in occluded objects.

3. Key Contributions

Training-Free Inference: FreeOcc is the first pipeline to perform panoptic occupancy prediction without any target-domain training. It runs entirely on foundation model inference.
Open-Vocabulary Flexibility: By leveraging promptable models, the system can adapt to new semantic classes simply by changing text prompts, without retraining a 3D network.
Pseudo-Label Generation: The pipeline serves as a high-quality pseudo-label generator. When used to train downstream models (e.g., STCOcc), it achieves state-of-the-art weakly supervised performance.
New Baselines: Establishes the first baselines for both train-free and weakly supervised panoptic occupancy prediction on the Occ3D-nuScenes dataset.

4. Experimental Results

Evaluated on the Occ3D-nuScenes validation set:

Semantic Occupancy

Train-Free: Achieves 16.9 mIoU and 16.5 RayIoU.
- Outperforms the previous train-free baseline (ShelfOcc: 9.6 mIoU) by a significant margin (+7.3 points).
- Competitive with trained weakly supervised methods (e.g., GaussianFlowOcc: 17.1 mIoU).
Weakly Supervised (Pseudo-Label Training): When used to train STCOcc, it achieves 22.8 mIoU and 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline (ShelfOcc+STCOcc: 20.0 RayIoU).
- Note: FreeOcc achieves this without using camera visibility masks during training, suggesting better generalization to hidden parts of the scene.

Panoptic Occupancy

Train-Free: 3.1 RayPQ.
Weakly Supervised: 3.9 RayPQ.
While lower than fully supervised methods (e.g., SparseOcc: 14.1 RayPQ), these results prove the feasibility of instance-aware prediction without 3D ground truth. The performance gap highlights that geometric alignment remains the primary bottleneck.

Ablation Studies

Prompt Design: Using synonyms in prompts (vs. raw class names) provided a +2.7 mIoU gain.
Refinement: The voxel refinement stack contributed +2.1 mIoU.
Instance Identification: The instance module provided the largest jump in Panoptic quality (RayPQ from 1.5 to 2.5 in the ablation).
Pose Dependency: Removing camera extrinsics (poses) caused a massive performance drop (mIoU dropped by 53%), indicating that accurate poses are currently essential for reliable 3D fusion.

5. Significance and Conclusion

FreeOcc demonstrates that foundation models can bridge the gap between 2D perception and 3D scene understanding without the heavy cost of 3D annotation and training.

Practicality: It enables rapid deployment in new environments (e.g., different cities or sensor setups) where no labeled 3D data exists.
Future Direction: The paper identifies geometry quality and pose accuracy as the main bottlenecks. Future work should focus on making these pipelines robust to inaccurate poses and improving the geometric fidelity of foundation models to close the gap with fully supervised methods.

In summary, FreeOcc sets a new standard for label-free 3D perception, proving that training-free pipelines can match or exceed the performance of traditional weakly supervised methods while offering superior flexibility and adaptability.