Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

This paper introduces a novel framework for monocular open-vocabulary 3D occupancy prediction in indoor scenes that leverages geometry-only supervision and 3D Language-Embedded Gaussians, enhanced by an opacity-aware Poisson-based aggregation operator and a progressive temperature decay schedule to overcome feature mixing and convergence challenges, thereby achieving state-of-the-art performance on the Occ-ScanNet benchmark.

Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen

Published 2026-02-27
📖 5 min read🧠 Deep dive

Imagine you are a robot trying to navigate a messy, cluttered living room. Your goal is to build a perfect 3D map of the room so you don't bump into things and can find specific items, like "the red mug" or "the cat."

For a long time, robots could only map rooms if they were taught a fixed list of things beforehand, like "chair," "table," or "wall." This is like a child who only knows the words in their first vocabulary book. If you asked them to find a "toaster," they would be confused because they've never heard that word.

This paper introduces LegoOcc, a new way for robots to build 3D maps of indoor rooms using just a single camera photo. The robot can understand any object you ask for, even ones it has never seen before, by using language as a guide.

Here is how it works, broken down with some fun analogies:

1. The Problem: The "Indoor Jungle" vs. The "Highway"

Previous robots were great at mapping outdoor driving scenes (like highways). Highways are open, with clear lanes and predictable objects (cars, signs).
But indoor rooms are like a dense jungle. Everything is packed tight, objects overlap, and there are thousands of tiny, specific things (a specific type of shoe, a stack of papers, a weird lamp).

  • The Old Way: Tried to use the "highway" rules for the "jungle." It failed because the indoor world is too messy and complex.
  • The New Way (LegoOcc): Adopts a strategy that accepts the messiness. Instead of trying to label every single item with a specific name during training, it focuses on where things are (geometry) first, and uses language to figure out what they are later.

2. The Core Tool: "Language-Embedded Gaussians"

Imagine the robot doesn't build the room out of solid blocks (like Minecraft). Instead, it builds the room out of invisible, glowing fog clouds (called Gaussians).

  • Each cloud has a shape, a size, and a "transparency" (how see-through it is).
  • The Magic: Each cloud also carries a tiny "backpack" containing a language description. One cloud might carry the concept of "chair," another "floor," and another "shoe."
  • The robot learns to place these clouds in 3D space so that when you look at them, they form the shape of the room.

3. Challenge #1: The "Ghost Cloud" Problem (Geometry)

The Issue: When the robot tries to figure out if a specific spot in the room is "occupied" (has something there) or "empty," it looks at these fog clouds.
In the past, the math used to combine these clouds was like trying to guess the weight of a pile of feathers by just looking at how they float. It was unstable. The robot would get confused and think empty space was full, or vice versa.
The Solution: The authors invented a new math trick called the "Poisson Approach."

  • Analogy: Imagine the clouds are like raindrops falling into a bucket. Instead of just counting how many drops hit, the robot calculates the probability that at least one drop hit a specific spot.
  • This makes the robot much better at distinguishing between "empty air" and "solid objects," even when the training data only tells it "occupied" or "free" (binary), without telling it what the object is.

4. Challenge #2: The "Smoothie" Problem (Semantics)

The Issue: When the robot looks at a photo, it sees a mix of objects. If a "chair" and a "table" overlap in the camera view, the robot's "fog clouds" get mixed together.

  • The Old Way: It was like blending a strawberry and a banana into a smoothie. The robot learned the taste of the "smoothie" (the mix), but it couldn't tell you which part was strawberry and which was banana. This made it bad at identifying specific items later.
  • The Solution: The authors use a "Progressive Temperature Decay" schedule.
  • Analogy: Imagine the robot is learning to sort marbles. At first, the marbles are warm and soft, so they stick together (easy to learn the general shape). As training progresses, the robot slowly "cools them down."
  • As they cool, the marbles become hard and distinct. The robot learns to stop blending the "chair" and "table" together and starts seeing them as sharp, separate entities. This allows the robot to say, "That specific cloud is definitely a chair," even if it was mixed with a table in the photo.

5. The Result: A Robot That Can Talk to You

Because of these two tricks, the robot can now:

  1. Build a map using only a single photo and a simple "occupied/free" label (which is cheap and easy to get).
  2. Answer questions like, "Show me where the shoes are," or "Find the paper," even if it was never explicitly trained on those words.

In Summary:
LegoOcc is like teaching a robot to build a 3D puzzle using invisible, language-tagged fog. It uses a new math trick to keep the fog from collapsing and a "cooling" schedule to make sure the fog doesn't blend into a messy smoothie. The result is a robot that can understand the messy, complex world of a human home and find anything you ask for, just by listening to your voice.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →