MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation

MoMaStage is a structured vision-language framework that enables robust long-horizon indoor mobile manipulation by guiding task planning through a topology-aware Skill-State Graph and ensuring execution reliability via a closed-loop mechanism that triggers semantic replanning upon detecting physical deviations, all without requiring explicit scene mapping.

Chenxu Li, Zixuan Chen, Yetao Li, Jiapeng Xu, Hongyu Ding, Jieqi Shi, Jing Huo, Yang Gao

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you are asking a very smart, well-read robot to perform a complex chore: "Go to the kitchen, find the grey plate on the counter, bring it to the dining table, and set it down."

Sounds simple, right? But for a robot, this is a nightmare. If the robot tries to do this in one giant leap of logic, it often gets confused. It might try to pick up the plate before it has walked to the kitchen, or it might try to put the plate down while it's still holding a cup. In the real world, things go wrong: the plate slips, the robot bumps into a chair, or the path is blocked. Without a plan B, the whole mission fails.

This paper introduces MoMaStage, a new way to help robots handle these long, complicated tasks without getting lost or crashing. Here is how it works, explained through simple analogies:

1. The Problem: The "Hallucinating" Brain

Current robots often use a "Vision-Language Model" (VLM). Think of this as a super-smart but slightly absent-minded librarian.

  • The Good: The librarian knows the meaning of words and can break a big task into small steps (e.g., "Walk," "Pick," "Place").
  • The Bad: The librarian has never actually moved a body. They might suggest a plan that sounds perfect in words but is physically impossible (like trying to pick up a heavy box with a hand that is already full). They also don't remember what happened two steps ago, leading to "hallucinations" where the robot tries to do things that don't make sense in the current situation.

2. The Solution: The "Train Track" System (MoMaStage)

MoMaStage fixes this by giving the librarian a map of valid train tracks instead of letting them wander off-road.

A. The Skill-State Graph (The Train Tracks)

Imagine the robot's possible actions are like train stations.

  • The Map: MoMaStage builds a "Skill-State Graph." This isn't a map of the room (like a floor plan); it's a map of what the robot can do next based on what it is currently holding or where it is.
  • The Rule: You can only take a train if the tracks connect. If the robot is holding a cup, the "Pick up plate" track is broken and doesn't exist. The system forces the robot to only choose steps that are physically possible right now.
  • The Benefit: This stops the robot from making logical mistakes before it even starts moving.

B. The Closed-Loop Execution (The Safety Inspector)

Even with a perfect map, real life is messy. What if the robot drops the plate?

  • Old Way: The robot would keep trying to follow the original plan, fail, and crash.
  • MoMaStage Way: The robot has a Safety Inspector (a closed-loop mechanism).
    • The robot constantly checks its own body (like checking if your hand is still holding the cup).
    • If something goes wrong (e.g., "I dropped the plate!"), the Safety Inspector immediately stops the train.
    • It doesn't panic. It goes back to the Train Tracks, finds a new valid path from the current situation (e.g., "Go back to the kitchen, pick up a new plate"), and starts moving again.

3. Why This is a Big Deal

The researchers tested this on real robots and in complex simulations. Here is what they found:

  • No More "Daydreaming": Because the robot is forced to follow the "Train Tracks," it stops wasting time thinking about impossible moves. It plans faster and uses less computer power.
  • Bouncing Back: When things go wrong (which happens often in the real world), MoMaStage doesn't give up. It recalculates a new path instantly.
  • Long-Horizon Success: Most robots fail after 3 or 4 steps. MoMaStage successfully completed tasks with 17+ steps (like navigating through a whole house and doing multiple chores) with a much higher success rate than previous methods.

The Bottom Line

Think of MoMaStage as giving a robot a smart GPS that knows the rules of physics.

  • Instead of just saying "Drive to the store," it says, "You can only drive to the store if your tank is full and the road isn't blocked."
  • If you get a flat tire, it doesn't just say "Error." It immediately reroutes you to the nearest gas station or tire shop.

By combining a "smart brain" (the VLM) with a "strict rulebook" (the Skill-State Graph) and a "safety net" (closed-loop replanning), MoMaStage makes robots much more reliable for doing the long, boring, and complex chores of daily life.