Articulated 3D Scene Graphs for Open-World Mobile Manipulation

This paper introduces MoMa-SG, a novel framework that constructs semantic-kinematic 3D scene graphs to enable long-horizon mobile manipulation by robustly inferring object articulation models from RGB-D sequences, validated through a new dataset and real-world experiments on diverse robotic platforms.

Martin Büchner, Adrian Röfer, Tim Engelbracht, Tim Welschehold, Zuria Bauer, Hermann Blum, Marc Pollefeys, Abhinav Valada

Published 2026-02-19
📖 5 min read🧠 Deep dive

Imagine you are a robot trying to clean a messy kitchen. You see a refrigerator, a drawer, and a cabinet. To a human, it's obvious: you pull the fridge handle to open it, and the milk inside moves with the door. But to a robot, the world is often just a static collection of shapes. It doesn't know that the fridge door swings on a hinge or that the drawer slides on a track. Without this knowledge, the robot might try to push the fridge door straight forward (and fail) or grab the milk while the door is still closed.

This paper introduces MoMa-SG, a new "brain" for robots that helps them understand how things move in the real world, not just where they are.

Here is a simple breakdown of how it works, using some everyday analogies:

1. The Problem: The Robot is "Blind" to Motion

Traditional robots build a map like a photograph: "There is a fridge here, a drawer there." But they don't know the rules of the game. They don't know that a drawer is a "sliding" object or a door is a "swinging" object. If you ask a standard robot to "get the milk," it might get stuck because it doesn't understand that the fridge door needs to be opened first and that the milk will move with the door.

2. The Solution: The "Movie Director" Approach

Instead of just taking a snapshot, MoMa-SG acts like a movie director. It watches a video of a human (or another robot) interacting with objects.

  • Spotting the Action: It looks for the "scenes" where things are moving. It ignores the boring parts where nothing happens and focuses on the moments someone opens a drawer or swings a door.
  • Tracking the Dots: Imagine putting tiny, invisible stickers on the moving parts of the door. As the human opens the door, the robot tracks how those stickers move. Even if the human's hand blocks the view (occlusion), the robot keeps tracking the stickers, like a detective following a suspect through a crowd.

3. The "Magic Formula": Figuring Out the Hinge

Once the robot has tracked how the stickers moved, it uses a special math trick (called Twist Estimation) to figure out the "secret rule" of the object.

  • The Analogy: Think of it like watching a door swing. The robot asks: "Did these points move in a straight line (like a drawer) or in a circle (like a door)?"
  • The Innovation: Previous methods were easily confused by noise or bad camera angles. MoMa-SG uses a new "regularization" technique. Imagine trying to guess the shape of a coin by looking at it through a foggy window. Old methods might guess it's a square because of the fog. MoMa-SG has a "fog filter" that says, "Even if it looks a bit blurry, I know coins are round," ensuring it correctly identifies the type of movement (sliding vs. swinging) even in messy, real-world conditions.

4. Building the "Family Tree" of the Room

Once the robot knows how the fridge door moves, it builds a 3D Scene Graph. Think of this as a family tree for the room's objects.

  • The Parent: The fridge door.
  • The Child: The milk carton inside.
  • The Relationship: The robot learns that the milk is "attached" to the door. If the door moves, the milk moves. If the door is closed, the milk is hidden.
  • The Discovery: The robot can now look inside the fridge (when open) and say, "Ah, that's a milk carton," and remember, "Okay, next time I need milk, I know I have to open the door first, and the milk will be right there."

5. The "Open-World" Superpower

Most robots need a pre-programmed list of objects ("This is a fridge, this is a drawer"). MoMa-SG is different. It's like a curious child who learns by doing.

  • It doesn't need to know what a "fridge" is called beforehand.
  • It just needs to see something move.
  • It can learn about a weird sliding cabinet, a weird swinging door, or a new type of container just by watching it once. This is called "One-Shot Learning."

6. Real-World Testing: The Robot Goes to Work

The researchers didn't just test this on a computer; they put it on real robots (a wheeled robot and a four-legged dog-like robot).

  • The Result: The robots could successfully navigate a house, find a fridge, open it, grab the milk, and close it again.
  • The "Retrial" Feature: If the robot misses the handle or drops the milk, the system is smart enough to realize, "That didn't work," and try again, adjusting its approach based on the map it built.

Summary

MoMa-SG is like giving a robot a pair of glasses that lets it see how things move, not just where they are. It turns a static map of a house into a dynamic, interactive playground where the robot understands that doors swing, drawers slide, and the things inside them travel along for the ride. This allows robots to finally perform complex, long-term tasks like "clean the kitchen" or "get me a snack" in a real, messy home without needing a manual for every single object.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →