OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding

Imagine you are walking through a brand-new, messy house for the first time. You are holding a camera, and your goal is to tell a friend (who is blindfolded) exactly what's in the room, where the furniture is, and how to navigate it.

The Problem with Old Robots
Most current AI robots are like students with a terrible memory. As you walk through the house, they try to remember every single photo you've taken so far.

The Memory Crash: After 10 minutes, they have thousands of photos. Their brain (the computer) gets overwhelmed trying to look at all of them at once to answer a simple question like, "Where is the chair?" They run out of battery and memory.
The "Blurry" Vision: They also struggle to understand 3D space. If you only see one leg of a table, they might get confused and think it's a weird stick, not a table. They lack the "common sense" to guess the rest of the object.

Enter OnlineSI: The Smart Tour Guide
The paper introduces OnlineSI, a new framework designed to be a "Smart Tour Guide" for robots. Instead of hoarding every photo, it uses three clever tricks to understand the world in real-time.

1. The "Mental Sketchbook" (Finite Spatial Memory)

Imagine you are drawing a map of the house on a small notepad.

Old Way: You keep adding new pages forever. Eventually, the notepad is too heavy to carry, and you can't find anything.
OnlineSI Way: You have a notepad with a fixed number of pages. As you walk and see new things, you don't just add pages; you update the existing ones.
- If you see a table from the side, you draw it.
- Later, you walk around and see the front of the table. You don't add a new page; you erase the old sketch and redraw it to fit the new view.
- The Result: The robot's memory stays small and manageable, no matter how long the video is. It only keeps the "best version" of what it has seen so far.

2. The "Super-Helper" (3D + Semantic Fusion)

The robot has a powerful brain (a Large Language Model, or LLM) that is great at reading and talking, but it's bad at looking at raw 3D shapes (like a cloud of dots).

The Analogy: Imagine trying to describe a "chair" to someone who has never seen one, using only a pile of sand. It's hard.
The Fix: OnlineSI gives the robot a "label maker." Before showing the sand (3D points) to the brain, it sticks little tags on the sand that say "This is a chair," "This is a table."
The Result: The brain can now easily understand, "Oh, this pile of sand with the 'chair' tag is a chair!" This helps the robot identify objects even when they are partially hidden or seen from weird angles.

3. The "Fuzzy Score" (Handling Uncertainty)

How do you grade a robot that is learning as it goes?

The Dilemma: If the robot sees a chair but only 20% of it is visible, should we say it "failed" for not describing the whole chair? Or should we say it "passed" because it couldn't see the rest?
The Solution: The authors invented a new grading system called the Fuzzy F1-Score.
- Strict Rule: "You must detect the whole chair." (Too hard for a robot peeking around a corner).
- Lenient Rule: "You must detect anything you can see." (Too easy, leads to false alarms).
- The Fuzzy Rule: "If the chair is mostly hidden, we don't penalize you for missing it. But if it's clearly visible, you must find it."
- The Result: This gives a fair score that acknowledges the robot is working in a difficult, changing environment.

Why This Matters

This isn't just about better video games. This is the foundation for real-world robots (like delivery bots, home assistants, or search-and-rescue drones) that need to:

Walk into a new building without crashing.
Remember where the stairs are while forgetting the dust bunnies under the sofa.
Update their map instantly if a chair is moved.

In a Nutshell:
OnlineSI is like giving a robot a smart, limited-size sketchbook and a label maker. It allows the robot to learn about a room as it walks through it, constantly refining its understanding without getting overwhelmed by too much data, making it ready to work in our messy, real world.

1. Problem Statement

The paper addresses the challenge of enabling Multimodal Large Language Models (MLLMs) to possess online spatial intelligence. While MLLMs have shown promise in 3D understanding, existing methods face three critical limitations when applied to real-world, embodied agents (e.g., robots) operating in dynamic environments:

Computational Scaling: Most methods process all past observations with full attention mechanisms. As the video stream grows, the context window and computational cost explode, making long-term deployment infeasible.
Memory Bottlenecks: Existing memory-based approaches often allow the memory bank to grow infinitely, leading to storage and retrieval bottlenecks.
Coarse Understanding & Ambiguity: Many models lack fine-grained 3D reasoning (e.g., precise object manipulation) and struggle with the ambiguity of online detection. In a streaming video, it is unclear whether a partially visible object "should" be detected, making standard evaluation metrics (like F1-score) unreliable.

2. Methodology: OnlineSI Framework

OnlineSI is a framework designed to perform incremental semantic reconstruction and object grounding from a monocular video stream. Its core architecture consists of four main components:

A. Finite Spatial Memory Management

Instead of storing all raw frames or an infinitely growing memory bank, OnlineSI maintains a bounded, explicit spatial memory ( $M_t = \{P_t, S_t\}$ ) comprising:

Point Cloud ( $P_t$ ): A 3D point map representing the scene geometry.
Semantic Map ( $S_t$ ): Semantic labels for each point.
Update Mechanism: For each incoming frame, the system reconstructs a point map and predicts semantic labels. These are fused with the previous memory using a sampling strategy. Crucially, the fusion ratio is adjusted based on the timestep to ensure the total number of points remains below a predefined threshold. This prevents memory overflow and ensures uniform sampling from the history, effectively curbing computational growth.

B. Unified Coordinate System

To handle the arbitrary 6D pose of a moving camera, the system transforms all point clouds into a unified, aligned coordinate frame.

The origin is set at the initial camera position.
The $xy$ -plane is parallel to the ground, and the $z$ -axis is perpendicular to it.
This alignment is necessary because the underlying MLLM (SpatialLM) struggles with arbitrary 3D rotations and requires the ground plane to be aligned with the coordinate axes.

C. Point Cloud and Semantic Fusion

The framework enhances the MLLM's ability to localize objects by tightly integrating 3D geometry with semantic information:

Point Cloud Encoder: Uses the Sonata encoder to convert the point cloud into 3D feature patches.
Semantic Encoder: Converts semantic labels into token features (using Llama-3.2 embeddings) and aggregates them using the same pooling structure as the point cloud encoder.
Fusion: The semantic features are added directly to the point cloud features. This creates "spatial memory tokens" that share the same granularity and spatial location, allowing the MLLM to reason about objects with both geometric and semantic context.

D. Inference and Output

The fused spatial memory tokens are fed into an LLM backbone (fine-tuned Llama-3.2-1B) along with a text prompt (e.g., "Detect {categories} in the scene"). The model outputs a scene description containing 3D bounding boxes (position, dimensions, and z-axis rotation) and class labels for all detected objects. As new frames arrive, the model refines previous detections (e.g., updating a partial table detection to a full one) and identifies new objects.

3. Key Contributions

OnlineSI Framework: A novel framework for online 3D scene understanding that maintains bounded memory space. It enables incremental processing of video streams without the computational cost growing linearly with time.
Semantic-Geometric Fusion: A new fusion technique that injects semantic information directly into point cloud features. This significantly improves the MLLM's ability to perform fine-grained object localization and recognition in partially observed scenes.
Fuzzy F1-Score: A novel evaluation metric designed to handle the ambiguity of partial observation in online settings.
- It defines two ground truths: Strict (high visibility, must be detected) and Lenient (includes low visibility, may be detected).
- The score calculates Recall against the Strict set and Precision against the Lenient set. This prevents penalizing the model for missing objects that were barely visible, providing a fairer assessment of online performance.

4. Experimental Results

The method was evaluated on ScanNet and ScanNet++ datasets.

Quantitative Performance: OnlineSI achieved a Fuzzy F1-Score of 0.4397 on ScanNet++ and 0.3349 on ScanNet. It significantly outperformed lower-bound baselines (e.g., SpatialLM-Merge at 0.3397 and SpatialLM-Finetune at 0.3943 on ScanNet++), demonstrating the value of both memory management and semantic fusion.
Inference Cost Scaling: Unlike baseline methods (e.g., VLM-3R) where compute time and memory usage grow linearly or super-linearly with the number of input frames, OnlineSI demonstrated sub-linear scaling. Once the memory limit is reached, inference time and memory usage remain constant.
Qualitative Analysis: Visualizations showed that OnlineSI successfully aggregates information over time to correct partial detections (e.g., refining a "chair" detection as more of it comes into view), whereas baselines often retained erroneous predictions from single frames.

5. Significance and Impact

Embodied AI Deployment: By solving the memory and computational scaling issues, OnlineSI makes it feasible to deploy MLLMs on embodied agents (robots) that need to learn and adapt in real-time within long-horizon tasks.
Robustness to Partial Observations: The framework and the Fuzzy F1-Score provide a more realistic approach to evaluating 3D understanding in dynamic, non-static environments where full visibility is rare.
Bridging 2D and 3D: The method effectively leverages off-the-shelf 2D semantic models to enhance 3D point cloud understanding, offering a practical path to high-quality 3D grounding without massive 3D-specific training data.

Limitations: The current system is restricted to indoor environments (due to the pre-training of the base SpatialLM) and uses a static sampling method for memory updates, which may struggle with highly dynamic scenes (future work suggests 4D reconstruction with tracking).