MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation

Imagine you are asking a very smart, well-read robot to perform a complex chore: "Go to the kitchen, find the grey plate on the counter, bring it to the dining table, and set it down."

Sounds simple, right? But for a robot, this is a nightmare. If the robot tries to do this in one giant leap of logic, it often gets confused. It might try to pick up the plate before it has walked to the kitchen, or it might try to put the plate down while it's still holding a cup. In the real world, things go wrong: the plate slips, the robot bumps into a chair, or the path is blocked. Without a plan B, the whole mission fails.

This paper introduces MoMaStage, a new way to help robots handle these long, complicated tasks without getting lost or crashing. Here is how it works, explained through simple analogies:

1. The Problem: The "Hallucinating" Brain

Current robots often use a "Vision-Language Model" (VLM). Think of this as a super-smart but slightly absent-minded librarian.

The Good: The librarian knows the meaning of words and can break a big task into small steps (e.g., "Walk," "Pick," "Place").
The Bad: The librarian has never actually moved a body. They might suggest a plan that sounds perfect in words but is physically impossible (like trying to pick up a heavy box with a hand that is already full). They also don't remember what happened two steps ago, leading to "hallucinations" where the robot tries to do things that don't make sense in the current situation.

2. The Solution: The "Train Track" System (MoMaStage)

MoMaStage fixes this by giving the librarian a map of valid train tracks instead of letting them wander off-road.

A. The Skill-State Graph (The Train Tracks)

Imagine the robot's possible actions are like train stations.

The Map: MoMaStage builds a "Skill-State Graph." This isn't a map of the room (like a floor plan); it's a map of what the robot can do next based on what it is currently holding or where it is.
The Rule: You can only take a train if the tracks connect. If the robot is holding a cup, the "Pick up plate" track is broken and doesn't exist. The system forces the robot to only choose steps that are physically possible right now.
The Benefit: This stops the robot from making logical mistakes before it even starts moving.

B. The Closed-Loop Execution (The Safety Inspector)

Even with a perfect map, real life is messy. What if the robot drops the plate?

Old Way: The robot would keep trying to follow the original plan, fail, and crash.
MoMaStage Way: The robot has a Safety Inspector (a closed-loop mechanism).
- The robot constantly checks its own body (like checking if your hand is still holding the cup).
- If something goes wrong (e.g., "I dropped the plate!"), the Safety Inspector immediately stops the train.
- It doesn't panic. It goes back to the Train Tracks, finds a new valid path from the current situation (e.g., "Go back to the kitchen, pick up a new plate"), and starts moving again.

3. Why This is a Big Deal

The researchers tested this on real robots and in complex simulations. Here is what they found:

No More "Daydreaming": Because the robot is forced to follow the "Train Tracks," it stops wasting time thinking about impossible moves. It plans faster and uses less computer power.
Bouncing Back: When things go wrong (which happens often in the real world), MoMaStage doesn't give up. It recalculates a new path instantly.
Long-Horizon Success: Most robots fail after 3 or 4 steps. MoMaStage successfully completed tasks with 17+ steps (like navigating through a whole house and doing multiple chores) with a much higher success rate than previous methods.

The Bottom Line

Think of MoMaStage as giving a robot a smart GPS that knows the rules of physics.

Instead of just saying "Drive to the store," it says, "You can only drive to the store if your tank is full and the road isn't blocked."
If you get a flat tire, it doesn't just say "Error." It immediately reroutes you to the nearest gas station or tire shop.

By combining a "smart brain" (the VLM) with a "strict rulebook" (the Skill-State Graph) and a "safety net" (closed-loop replanning), MoMaStage makes robots much more reliable for doing the long, boring, and complex chores of daily life.

Here is a detailed technical summary of the paper "MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation."

1. Problem Statement

Indoor Long-Horizon Mobile Manipulation (MoMA) involves robots executing complex sequences of navigation and object interaction (e.g., "move a plate from the kitchen to the dining table") in dynamic environments. Current approaches face three critical bottlenecks:

Cascading Errors: Learning-based end-to-end policies struggle with logical consistency over long horizons; a single error early in the sequence often leads to total task failure.
Physical Hallucinations: Vision-Language Models (VLMs) are excellent at semantic reasoning but often generate "physically impossible" plans (e.g., picking up an object while the gripper is already full) because they lack explicit state tracking.
Rigid Representations: Traditional symbolic planners rely on explicit 3D scene maps, which are computationally expensive, brittle to dynamic changes, and difficult to scale.

The core challenge is to bridge high-level semantic reasoning with low-level physical constraints without relying on heavy explicit mapping, while ensuring robust recovery from execution failures.

2. Methodology: MoMaStage Framework

MoMaStage is a map-free, structured vision-language framework that replaces open-loop planning with a closed-loop, state-grounded execution pipeline. It consists of three core modules:

A. Structured Skill Grounding

Instead of a full world model, MoMaStage uses a lightweight abstraction to track the robot's physical state:

Hierarchical Skill Library:
- Action-Level Skills: Low-level primitives (e.g., joint control, basic motion) decoupled from semantics.
- Semantic-Level Skills: High-level tasks (e.g., Pick, Place, Navigate) with explicit preconditions and sequential relationships.
Skill-State Graph ( $G$ ): A topology-aware graph where nodes represent semantic skills. Crucially, each node is augmented with:
- Precondition State ( $C$ ): A tuple describing the robot's location and gripper states (e.g., L_scene, O_left, O_right).
- State Variation ( $\Delta$ ): A function defining how a skill changes the state (e.g., ADD object to gripper, SUB object, MOVE location).
- This graph acts as a "feasibility filter," ensuring that only transitions respecting the robot's current physical state are considered valid.

B. Graph-Constrained Planning and Verification

The planning process is a two-stage loop:

Topology-Aware Semantic Planning: The VLM decomposes natural language instructions into a candidate skill sequence. Unlike standard VLMs, it is constrained by the Skill-State Graph's topology, forcing it to select only adjacent, feasible skills.
State-Driven Feasibility Verification: A post-hoc check recursively applies the state variation ( $\Delta$ ) of the proposed sequence to the initial state. If a skill's precondition conflicts with the accumulated state (e.g., trying to Pick when the gripper is occupied), the plan is rejected, and the VLM is triggered to re-decompose the task.

C. Closed-Loop Execution and Replanning

To handle real-world dynamics, the system decouples physical safety from semantic reasoning:

Ego-State Monitoring: High-frequency proprioceptive sensors (joint encoders, tactile sensors) monitor the physical execution of primitives.
Semantic Verification: The VLM verifies if the observed scene state matches the expected outcome of the skill.
Graph-Grounded Replanning: Upon detecting a deviation (e.g., a failed grasp), the system does not restart. Instead, it treats the current physical state as the new initial state and queries the Skill-State Graph to find a corrective path to the remaining subtasks. This allows for autonomous recovery from out-of-distribution disruptions.

3. Key Contributions

MoMaStage Framework: A novel, map-free architecture that unifies instruction understanding, skill-chain generation, and feedback-driven refinement into a single closed-loop pipeline.
State-Grounded Skill Planning: The introduction of the Skill-State Graph, which explicitly models how skills alter the robot's embodiment state. This prevents logical hallucinations and enforces cumulative state feasibility without requiring heavy 3D scene reconstruction.
Robust Closed-Loop Execution: A mechanism that uses lightweight proprioceptive monitoring to trigger targeted semantic replanning only when necessary, balancing inference latency with execution robustness.
Empirical Validation: Extensive experiments demonstrating that explicitly modeling state transitions is more effective than relying solely on VLM reasoning or monolithic learning policies.

4. Experimental Results

The framework was evaluated in real-world environments (Agilex Cobot Magic platform with 4 arms) and simulations (mshab* benchmark).

Success Rates:
- Real-World: MoMaStage achieved a 60% cumulative final success rate on 17-step long-horizon tasks. In contrast, baselines like End-to-End ACT failed immediately (0%), and DeCo* (a VLM baseline) dropped to 0% by subtask 7 due to lack of recovery.
- Simulation: MoMaStage maintained success rates between 79% and 94% across diverse scenes (Tidy House, Prepare Groceries, Set Table), significantly outperforming DeCo* (20–47%) and Ground Truth (GT) open-loop execution, which degraded rapidly due to compounding physical errors.
Efficiency:
- MoMaStage reduced token overhead by ~28% and inference time by ~35% compared to an unpruned graph variant (MoMaStage*), proving that structurally refining the search space alleviates cognitive load on the VLM.
- Planning success reached 100% for MoMaStage, whereas DeCo* failed to generate valid plans 100% of the time in complex scenarios.
Failure Analysis:
- The system successfully recovered from execution anomalies (e.g., grasping slips) via replanning.
- Failures were primarily attributed to low-level physical limitations (simulator physics bugs, mesh clipping) rather than high-level semantic errors, validating the robustness of the planning logic.

5. Significance

MoMaStage represents a paradigm shift in mobile manipulation by demonstrating that explicit state grounding is more critical than richer scene representations for long-horizon tasks.

Bridging the Gap: It effectively bridges the gap between high-level semantic reasoning (VLMs) and low-level physical execution, solving the "physical hallucination" problem common in generative AI robotics.
Scalability: By avoiding explicit 3D mapping and relying on a lightweight state graph, the system is more adaptable to dynamic, unstructured environments.
Future Direction: The work suggests that future advancements in MoMA should focus less on improving VLM reasoning and more on enhancing the physical dexterity and robustness of low-level action skills, as the planning bottleneck has largely been solved by this approach.