Articulated 3D Scene Graphs for Open-World Mobile Manipulation

Imagine you are a robot trying to clean a messy kitchen. You see a refrigerator, a drawer, and a cabinet. To a human, it's obvious: you pull the fridge handle to open it, and the milk inside moves with the door. But to a robot, the world is often just a static collection of shapes. It doesn't know that the fridge door swings on a hinge or that the drawer slides on a track. Without this knowledge, the robot might try to push the fridge door straight forward (and fail) or grab the milk while the door is still closed.

This paper introduces MoMa-SG, a new "brain" for robots that helps them understand how things move in the real world, not just where they are.

Here is a simple breakdown of how it works, using some everyday analogies:

1. The Problem: The Robot is "Blind" to Motion

Traditional robots build a map like a photograph: "There is a fridge here, a drawer there." But they don't know the rules of the game. They don't know that a drawer is a "sliding" object or a door is a "swinging" object. If you ask a standard robot to "get the milk," it might get stuck because it doesn't understand that the fridge door needs to be opened first and that the milk will move with the door.

2. The Solution: The "Movie Director" Approach

Instead of just taking a snapshot, MoMa-SG acts like a movie director. It watches a video of a human (or another robot) interacting with objects.

Spotting the Action: It looks for the "scenes" where things are moving. It ignores the boring parts where nothing happens and focuses on the moments someone opens a drawer or swings a door.
Tracking the Dots: Imagine putting tiny, invisible stickers on the moving parts of the door. As the human opens the door, the robot tracks how those stickers move. Even if the human's hand blocks the view (occlusion), the robot keeps tracking the stickers, like a detective following a suspect through a crowd.

3. The "Magic Formula": Figuring Out the Hinge

Once the robot has tracked how the stickers moved, it uses a special math trick (called Twist Estimation) to figure out the "secret rule" of the object.

The Analogy: Think of it like watching a door swing. The robot asks: "Did these points move in a straight line (like a drawer) or in a circle (like a door)?"
The Innovation: Previous methods were easily confused by noise or bad camera angles. MoMa-SG uses a new "regularization" technique. Imagine trying to guess the shape of a coin by looking at it through a foggy window. Old methods might guess it's a square because of the fog. MoMa-SG has a "fog filter" that says, "Even if it looks a bit blurry, I know coins are round," ensuring it correctly identifies the type of movement (sliding vs. swinging) even in messy, real-world conditions.

4. Building the "Family Tree" of the Room

Once the robot knows how the fridge door moves, it builds a 3D Scene Graph. Think of this as a family tree for the room's objects.

The Parent: The fridge door.
The Child: The milk carton inside.
The Relationship: The robot learns that the milk is "attached" to the door. If the door moves, the milk moves. If the door is closed, the milk is hidden.
The Discovery: The robot can now look inside the fridge (when open) and say, "Ah, that's a milk carton," and remember, "Okay, next time I need milk, I know I have to open the door first, and the milk will be right there."

5. The "Open-World" Superpower

Most robots need a pre-programmed list of objects ("This is a fridge, this is a drawer"). MoMa-SG is different. It's like a curious child who learns by doing.

It doesn't need to know what a "fridge" is called beforehand.
It just needs to see something move.
It can learn about a weird sliding cabinet, a weird swinging door, or a new type of container just by watching it once. This is called "One-Shot Learning."

6. Real-World Testing: The Robot Goes to Work

The researchers didn't just test this on a computer; they put it on real robots (a wheeled robot and a four-legged dog-like robot).

The Result: The robots could successfully navigate a house, find a fridge, open it, grab the milk, and close it again.
The "Retrial" Feature: If the robot misses the handle or drops the milk, the system is smart enough to realize, "That didn't work," and try again, adjusting its approach based on the map it built.

Summary

MoMa-SG is like giving a robot a pair of glasses that lets it see how things move, not just where they are. It turns a static map of a house into a dynamic, interactive playground where the robot understands that doors swing, drawers slide, and the things inside them travel along for the ride. This allows robots to finally perform complex, long-term tasks like "clean the kitchen" or "get me a snack" in a real, messy home without needing a manual for every single object.

1. Problem Statement

Robotic mobile manipulation in real-world, open-world environments faces a critical limitation: the inability to anticipate how objects move. While existing 3D scene graphs effectively capture static geometry and semantics, they lack kinematic understanding of articulated objects (e.g., doors, drawers, cabinets).

The Gap: Current methods struggle to bridge the gap between semantics (what an object is), geometry (where it is), and kinematics (how it moves).
The Challenge: Robots need to infer articulation models (joint types, axes, limits) from sparse, noisy, real-world observations (RGB-D sequences) that often involve occlusions, drift, and varying observation paradigms (ego-centric, exo-centric, robot-centric).
Goal: To enable long-horizon mobile manipulation by constructing Semantic-Kinematic 3D Scene Graphs that allow robots to reason about, navigate to, and manipulate articulated objects without prior knowledge of the environment or fixed semantic categories.

2. Methodology: MoMa-SG Framework

The proposed framework, MoMa-SG, constructs a hierarchical 3D scene graph from "in-the-wild" RGB-D sequences. The pipeline consists of four main stages:

A. Interaction Discovery

The system first segments the continuous video stream into temporal interaction segments where dynamic object motion occurs.

Signals: It combines two independent signals:
1. Interaction Prior: A YOLOv9-based mask detecting interacting agents (e.g., hands).
2. Depth Disparity: A measure of scene dynamics by warping previous depth maps to the current view and calculating pixel-wise differences.
Fusion: These signals are fused probabilistically to identify interaction intervals, robustly handling occlusions (e.g., hidden hands) and low-dynamics interactions.

B. Articulation Estimation

For each interaction segment, the system estimates the kinematic model of the manipulated object.

Point Tracking: Instead of mesh-based approaches, it uses CoTracker3 for robust point tracking on disparity masks, allowing tracking through hand occlusions.
Twist Estimation: The core innovation is a unified twist estimation formulation based on screw theory.
- It represents motion as a twist $\xi = \langle \omega, v \rangle$ (rotational and translational velocity).
- Novel Regularization: To handle noise and drift in real-world data, the authors introduce a scaled dot-product prior. By analyzing the angular deviation between vectors sampled from point trajectories, the system distinguishes between revolute (finite radius, high angular deviation) and prismatic (infinite radius, low angular deviation) joints within a single optimization pass.
Mode Understanding: A Vision-Language Model (VLM, GPT-5-mini) is used to classify the observed motion mode (Opening, Closing, or combined) based on the estimated configuration states.

C. Articulated 3D Scene Graph Construction

The system builds a hierarchical graph $G = (V, E)$ :

Object Layer ( $O$ ): Represents articulated objects and their 3D parts using open-vocabulary semantic features (CLIP) derived from Semantic-SAM.
Articulation Association: A Binary Integer Program (BIP) is solved to mutually exclusively match estimated articulations to object parts, minimizing 3D overlap and handling over/under-segmentation.
Contained Objects ( $C$ ): The system identifies objects inside articulated parents (e.g., milk in a fridge) by reasoning over parent-child relations at maximum opening states. It classifies children as STATIC (fixed in world coordinates) or ARTICULATED (moving with the parent).

D. Real-World Execution

The resulting graph serves as a backbone for mobile manipulation. The robot uses the estimated kinematic models to:

Plan end-effector trajectories consistent with the object's articulation (using exponential maps).
Perform grasping, opening, closing, and retrieval tasks.
Execute retrials if an action fails (e.g., gripping failure), leveraging the scene graph for state re-evaluation.

3. Key Contributions

MoMa-SG Framework: A unified, embodiment-agnostic framework for constructing semantic-kinematic 3D scene graphs from diverse observation paradigms (ego-centric, exo-centric, robot-centric) without requiring fiducial markers or fixed semantic categories.
Novel Twist Regularization: A robust optimization objective that estimates both revolute and prismatic joint parameters from noisy point trajectories using a geometric prior, eliminating the need for separate type-parsing heuristics.
Arti4D-Semantic Dataset: The first benchmark dataset specifically designed for articulated 3D scene graph understanding. It includes:
- 62 RGB-D sequences with 600 object interactions.
- Hierarchical labels for parent-child relations and containment.
- Three distinct observation paradigms (Ego, Exo, Robot).
Real-World Validation: Successful deployment on two distinct robotic platforms (Toyota HSR and Boston Dynamics Spot) in unstructured home environments, demonstrating robust long-horizon manipulation.

4. Experimental Results

The authors evaluated MoMa-SG on the Arti4D-Semantic dataset and the DROID dataset, comparing against state-of-the-art baselines like ArtiPoint, Pandora, and ArtGS.

Interaction Segmentation: MoMa-SG achieved a 1D-IoU of 0.649 and Segment-IoU of 0.718, outperforming Pandora (0.359) and ArtiPoint (0.575).
Articulation Estimation:
- Prismatic Joints: MoMa-SG achieved an axis-angle error of 13.19° and positional error of 0.091m, significantly outperforming ArtiPoint (23.27°) and Pandora (46.81°).
- Revolute Joints: Achieved 22.98° error and 0.884 type prediction accuracy.
- DROID Dataset: On a large-scale robot manipulation dataset, MoMa-SG reduced prismatic error to 7.15° (vs. 35.88° for ArtiPoint) and improved type accuracy to 89.5%.
Object Understanding:
- 3D Part Segmentation: MoMa-SG achieved an IoU of 0.533 for free objects and 0.292 for articulation-matched objects, significantly outperforming Pandora (which struggled with mesh completion).
- Contained Objects: Achieved a relation accuracy of 59.2% in identifying parent-child edges, compared to 19.7% for Pandora.
Real-World Manipulation:
- On the HSR and Spot robots, the system achieved >80% success rates in opening and closing various articulated objects across different environments.
- The system successfully executed long-horizon tasks (e.g., "get milk from the fridge") involving navigation, opening, inspection, and retrieval.

5. Significance and Impact

Bridging Perception and Action: MoMa-SG moves beyond static mapping by integrating kinematics directly into the scene graph, enabling robots to physically interact with dynamic environments rather than just navigating around them.
Open-World Capability: By avoiding assumptions about fixed object categories or camera setups, the framework is applicable to a wide range of robots and unstructured environments.
Data-Driven Advancement: The release of the Arti4D-Semantic dataset fills a major gap in robotics research, providing the first benchmark for evaluating hierarchical, articulated scene understanding in the wild.
Practical Utility: The demonstration on real hardware proves that semantic-kinematic graphs are not just theoretical constructs but viable backbones for robust, long-horizon mobile manipulation in everyday human settings.

In conclusion, MoMa-SG represents a significant step forward in embodied AI, providing a robust method for robots to learn the "physics" of their environment (how things move) from a single observation and use that knowledge to perform complex manipulation tasks.