Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Imagine you are a robot trying to navigate a messy, cluttered living room. Your goal is to build a perfect 3D map of the room so you don't bump into things and can find specific items, like "the red mug" or "the cat."

For a long time, robots could only map rooms if they were taught a fixed list of things beforehand, like "chair," "table," or "wall." This is like a child who only knows the words in their first vocabulary book. If you asked them to find a "toaster," they would be confused because they've never heard that word.

This paper introduces LegoOcc, a new way for robots to build 3D maps of indoor rooms using just a single camera photo. The robot can understand any object you ask for, even ones it has never seen before, by using language as a guide.

Here is how it works, broken down with some fun analogies:

1. The Problem: The "Indoor Jungle" vs. The "Highway"

Previous robots were great at mapping outdoor driving scenes (like highways). Highways are open, with clear lanes and predictable objects (cars, signs).
But indoor rooms are like a dense jungle. Everything is packed tight, objects overlap, and there are thousands of tiny, specific things (a specific type of shoe, a stack of papers, a weird lamp).

The Old Way: Tried to use the "highway" rules for the "jungle." It failed because the indoor world is too messy and complex.
The New Way (LegoOcc): Adopts a strategy that accepts the messiness. Instead of trying to label every single item with a specific name during training, it focuses on where things are (geometry) first, and uses language to figure out what they are later.

2. The Core Tool: "Language-Embedded Gaussians"

Imagine the robot doesn't build the room out of solid blocks (like Minecraft). Instead, it builds the room out of invisible, glowing fog clouds (called Gaussians).

Each cloud has a shape, a size, and a "transparency" (how see-through it is).
The Magic: Each cloud also carries a tiny "backpack" containing a language description. One cloud might carry the concept of "chair," another "floor," and another "shoe."
The robot learns to place these clouds in 3D space so that when you look at them, they form the shape of the room.

3. Challenge #1: The "Ghost Cloud" Problem (Geometry)

The Issue: When the robot tries to figure out if a specific spot in the room is "occupied" (has something there) or "empty," it looks at these fog clouds.
In the past, the math used to combine these clouds was like trying to guess the weight of a pile of feathers by just looking at how they float. It was unstable. The robot would get confused and think empty space was full, or vice versa.
The Solution: The authors invented a new math trick called the "Poisson Approach."

Analogy: Imagine the clouds are like raindrops falling into a bucket. Instead of just counting how many drops hit, the robot calculates the probability that at least one drop hit a specific spot.
This makes the robot much better at distinguishing between "empty air" and "solid objects," even when the training data only tells it "occupied" or "free" (binary), without telling it what the object is.

4. Challenge #2: The "Smoothie" Problem (Semantics)

The Issue: When the robot looks at a photo, it sees a mix of objects. If a "chair" and a "table" overlap in the camera view, the robot's "fog clouds" get mixed together.

The Old Way: It was like blending a strawberry and a banana into a smoothie. The robot learned the taste of the "smoothie" (the mix), but it couldn't tell you which part was strawberry and which was banana. This made it bad at identifying specific items later.
The Solution: The authors use a "Progressive Temperature Decay" schedule.
Analogy: Imagine the robot is learning to sort marbles. At first, the marbles are warm and soft, so they stick together (easy to learn the general shape). As training progresses, the robot slowly "cools them down."
As they cool, the marbles become hard and distinct. The robot learns to stop blending the "chair" and "table" together and starts seeing them as sharp, separate entities. This allows the robot to say, "That specific cloud is definitely a chair," even if it was mixed with a table in the photo.

5. The Result: A Robot That Can Talk to You

Because of these two tricks, the robot can now:

Build a map using only a single photo and a simple "occupied/free" label (which is cheap and easy to get).
Answer questions like, "Show me where the shoes are," or "Find the paper," even if it was never explicitly trained on those words.

In Summary:
LegoOcc is like teaching a robot to build a 3D puzzle using invisible, language-tagged fog. It uses a new math trick to keep the fog from collapsing and a "cooling" schedule to make sure the fog doesn't blend into a messy smoothie. The result is a robot that can understand the messy, complex world of a human home and find anything you ask for, just by listening to your voice.

1. Problem Statement

The paper addresses the challenge of Open-Vocabulary 3D Occupancy Prediction in indoor environments using only a monocular camera.

Context: Embodied agents (robots, AR/VR) require a unified understanding of 3D geometry and semantics to navigate complex indoor spaces.
Limitations of Existing Work:
- Closed-Vocabulary: Traditional methods are restricted to a fixed set of categories defined during training, limiting real-world deployment where objects are diverse and long-tailed.
- Outdoor vs. Indoor Gap: While Open-Vocabulary methods exist for outdoor driving (e.g., roads, cars), they fail indoors due to denser geometry, intricate layouts, and fine-grained semantic categories.
- Supervision Cost: Existing indoor open-vocabulary approaches often rely on expensive 3D semantic annotations or distillation from 2D segmentation, which is difficult to scale.
Goal: Develop a framework that predicts 3D occupancy for arbitrary text queries using only binary occupancy labels (occupied vs. free) for supervision, eliminating the need for dense 3D semantic ground truth.

2. Methodology: LegoOcc Framework

The proposed framework, LegoOcc, utilizes 3D Language-Embedded Gaussians (LE-Gaussians) as a unified intermediate representation. This representation couples native geometric parameters (position, scale, rotation, opacity) with learnable, language-aligned semantic embeddings.

The system operates in two coupled learning paths:

A. Geometry Learning: Opacity-Aware Poisson-based Gaussian-to-Occupancy (G2O)

Challenge: Standard Gaussian-to-Occupancy operators (e.g., GaussianFormer2) aggregate Gaussians multiplicatively but ignore opacity ( $\alpha$ ). When trained with only binary occupancy labels, this leads to unstable convergence and a mismatch between 2D rendering (which uses opacity) and 3D aggregation.
Solution: The authors introduce a Poisson-based G2O operator.
- Instead of treating Gaussians as independent Bernoulli trials, they model the contribution of each Gaussian as a non-negative event intensity ( $h_i = \alpha_i p_i$ ).
- The total occupancy at a voxel is modeled as the probability of at least one event occurring in a non-homogeneous Poisson process:
  $p(x) = 1 - \exp\left(-\sum_{i=1}^{N} \alpha_i p_i(x)\right)$
- This formulation explicitly incorporates opacity into the geometry branch, ensuring consistency with the rendering process and stabilizing training under weak (binary-only) supervision.

B. Semantic Learning: Progressive Temperature Decay

Challenge: In indoor scenes, objects heavily overlap in 2D projections. Naive $\alpha$ -blending during Gaussian splatting creates "feature mixing," where a single pixel's feature becomes a weighted average of multiple overlapping Gaussian embeddings. This dilutes the semantic signal and hinders alignment with open-vocabulary text features.
Solution: A Progressive Temperature Decay schedule.
- The opacity $\alpha_i$ is computed via a tempered sigmoid: $\alpha_i = \sigma(\text{logit}_i / \tau)$ .
- During training, the temperature $\tau$ is annealed from a high value ( $T_{max}=1$ ) to a low value ( $T_{min}=10^{-3}$ ) using an exponential decay schedule.
- Effect: Initially, high $\tau$ allows smooth feature mixing for stable optimization. As $\tau$ decreases, opacities sharpen towards binary (0 or 1), effectively suppressing cross-category blending and forcing the model to assign distinct, discriminative language embeddings to specific Gaussians.

C. Training Objective

The model is trained with a composite loss function:

Geometry Loss: Binary cross-entropy (Focal Loss + Lovász-Softmax) against binary occupancy ground truth, plus a depth loss for stability.
Semantic Loss: Cosine similarity between rendered Gaussian features and features from a training-free open-vocabulary segmenter (e.g., Trident). No 2D human annotations are required for the semantic branch.

3. Key Contributions

LegoOcc Framework: The first monocular open-vocabulary occupancy framework specifically designed for large-scale indoor scenes, enabling agents to reason about arbitrary objects beyond fixed taxonomies.
Poisson-based G2O Operator: A novel geometric aggregation method that resolves the instability of training Gaussians with binary-only supervision by explicitly modeling opacity as an event intensity.
Progressive Temperature Decay: A training strategy that mitigates feature mixing in dense indoor scenes, significantly improving the alignment between 3D Gaussian embeddings and language semantics.
Geometry-Only Supervision: Demonstrates that high-quality open-vocabulary 3D understanding can be achieved without expensive 3D semantic annotations, relying solely on binary occupancy labels.

4. Experimental Results

The method was evaluated on the Occ-ScanNet dataset (11 semantic classes + empty).

Performance Metrics:
- IoU (Intersection over Union): 59.50 (Surpasses all existing methods, including closed-vocabulary baselines).
- mIoU (Mean IoU): 21.05 (Outperforms prior open-vocabulary methods by a massive margin of 11.80 points, more than 2x the previous best).
Comparisons:
- Outperforms re-implemented baselines (POP-3D, LOcc) which struggle with geometry-only supervision.
- Surpasses closed-vocabulary methods (e.g., EmbodiedOcc++) in IoU, proving that the proposed geometry operator is highly effective even without semantic labels.
Efficiency: Achieves 22.47 FPS on an RTX 4090, which is faster than many dense volumetric baselines.
Ablation Studies:
- Replacing the Poisson operator with Bernoulli or standard G2O caused significant drops in mIoU (e.g., from 21.05 to 17.25).
- Removing the temperature decay schedule (keeping $\tau=1$ ) resulted in poor semantic discrimination (18.15 mIoU), confirming the necessity of sharpening opacities.

5. Significance

This work bridges a critical gap in embodied AI by enabling text-driven 3D scene understanding in complex indoor environments without the prohibitive cost of 3D semantic annotation.

Scalability: By relying on binary occupancy (which can be generated via depth fusion) and pre-trained 2D vision-language models, the approach is highly scalable to new environments and categories.
Generalization: The ability to query arbitrary categories (e.g., "shoes," "paper," "random objects") makes the system adaptable to real-world scenarios where the set of objects is unknown and dynamic.
Technical Insight: The paper provides crucial insights into the instability of Gaussian splatting under weak supervision and offers robust solutions (Poisson aggregation and temperature annealing) that likely have broader applications in 3D reconstruction and neural rendering.