To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

Imagine you are a delivery driver in a brand-new city, but you don't have a map. Your job is to drop off 20 different packages (like a pillow, a vase, or a bottle) to specific houses (like a bed, a table, or a desk).

The Problem:
In most robot navigation movies, the roads are always clear. But in real life, imagine walking into a house where someone has piled up boxes, chairs, and laundry right in the middle of the hallway. You can't get to the kitchen because the path is completely blocked.

Old robots would just say, "I can't go there," and stop. Or, they would try to squeeze through, get stuck, and give up. They treat the environment like a static puzzle that must be solved once.

The New Idea: "Lifelong Interactive Navigation"
This paper introduces a robot that doesn't just drive; it thinks and moves things. It's like a smart mover who realizes that if they move a heavy sofa out of the way now, it might make the next 19 deliveries much easier.

The core question the robot asks itself is: "To Move or Not to Move?"

The Creative Analogy: The "Smart Librarian" vs. The "Brute Force Janitor"

To understand why this robot is special, let's compare it to two other characters:

The Brute Force Janitor (The "Clean Everything" Robot):
Imagine a janitor who, before delivering a single book, decides to move every single chair, table, and lamp in the entire library to the back room.
- Pros: The path is perfectly clear.
- Cons: It takes forever. The janitor is exhausted, and the library is now a mess of furniture in the back. If you need to deliver 20 books, this approach is too slow and inefficient.
The Passive Observer (The "Detour" Robot):
Imagine a person who sees a chair blocking the path and just walks around it, even if it means walking in a giant, confusing circle for 10 minutes.
- Pros: They don't touch anything.
- Cons: They waste a lot of time, and if the chair was blocking a door to a whole new room, they might never find the next package.
The Smart Librarian (This Paper's Robot):
This robot is like a brilliant librarian. It looks at the mess and asks:
- "Is this chair blocking the only door to the next room? Yes? Okay, I'll move it."
- "Is this chair just in the middle of a wide hallway? No? I'll just walk around it."
- "If I move this heavy box now, will it block the path for the next 10 deliveries? Yes? Okay, I'll put it somewhere safe, not just anywhere."

How Does It Think? (The "Brain" and the "Eyes")

The robot uses a special combination of tools to make these decisions:

The Eyes (Active Perception): The robot doesn't know the whole house at first. It has to look around. As it moves, it builds a mental map (a "scene graph") of what it sees: "There's a red bottle here, a desk there, and a paper towel roll blocking the door."
The Brain (The Large Language Model): This is the magic part. Instead of programming the robot with thousands of specific rules (e.g., "If you see a red bottle, do X"), the researchers use a Large Language Model (LLM)—the same kind of AI that writes essays or chats with you.
- They don't ask the AI to "drive the robot."
- Instead, they ask the AI to be a Constraint Reasoner. They give it a list of facts: "The paper towel roll is blocking the path to the desk. Moving it takes 5 seconds. The desk is in the bedroom. We have 19 more tasks to do."
- The AI then reasons: "Moving the paper towel roll now will save us 10 minutes of walking later. Let's do it. But let's put it in the black box, not on the floor, so it doesn't block the next room."

The "Zero-Shot" Superpower

Usually, to teach a robot a new trick, you have to train it for weeks on that specific trick. This robot is Zero-Shot.

Think of it like a human who has never seen a specific messy room before. You walk in, look at the clutter, and instantly know, "I should move that box to get to the fridge." You didn't need to practice moving boxes in that specific room for 1,000 hours. You just used your common sense.

This robot does the same. It uses its "common sense" (the LLM) to figure out how to handle any new messy room it encounters, without needing to be re-trained.

The Results: Why It Matters

The researchers tested this in a massive virtual world with 10,000 different messy rooms.

The "Brute Force" robots moved too much stuff and took too long.
The "Passive" robots got stuck or took huge detours.
The "Smart Librarian" (this robot) moved just the right amount of stuff, at the right time, to the right place.

It completed the tasks faster than the experts who tried to clean everything, and it succeeded way more often than the robots that refused to move anything.

The Real-World Test

Finally, they put this brain on a real robot (a Boston Dynamics Spot, which looks like a robotic dog with an arm). They gave it a real task: "Bring the red bottle to the desk."
The robot looked around, saw a paper towel roll blocking the way, decided to move it, placed it neatly in a black box, and then successfully delivered the bottle. It did this without any human telling it exactly how to move the roll, proving that this "thinking" approach works in the real, messy world.

In short: This paper teaches robots to stop just driving around obstacles and start strategically rearranging their world to make their future jobs easier, using AI to make smart, long-term decisions just like a human would.

1. Problem Definition: Lifelong Interactive Navigation

The paper addresses a critical gap in visual navigation: the assumption that an obstacle-free path always exists between a start and a goal. In real-world environments (homes, warehouses), clutter often blocks all routes.

The authors introduce the Lifelong Interactive Navigation problem, characterized by:

Sequential Tasks: A mobile manipulator receives a stream of tasks (e.g., "Bring Object A to Receptacle B") in the same evolving environment.
Unknown & Cluttered Environments: The robot starts with no knowledge of object locations or clutter. It must explore and discover the scene.
Long-Horizon Consequences: Every decision (to move an obstacle or detour) has lasting effects. Moving a clutter item might clear a path for the current task but block a future one, or vice versa.
Partial Observability: The robot must actively perceive the environment to build a map while reasoning about future tasks.

2. Methodology

The proposed framework decouples strategic long-horizon planning from tactical low-level control, utilizing a Large Language Model (LLM) as a high-level constraint reasoner rather than a sequence generator.

A. Structured Scene Graph & Perception

Grid Graph ( $G$ ): The environment is represented as a grid graph where nodes are traversable cells.
Scene Graph ( $E_t$ ): As the robot explores, it builds a directed graph where nodes represent discovered objects/rooms, and edges represent blocking relations (e.g., Object A blocks the shortest path to Object B).
Metrics: The system calculates Betweenness Centrality for obstacles to determine how critical they are to global connectivity. It also estimates path costs and manipulation costs.

B. LLM as a Constraint Reasoner

Instead of asking the LLM to output a sequence of motor commands (which leads to hallucination and poor generalization), the LLM acts as a constraint solver:

Input: A structured text serialization of the scene graph, including object attributes, blocking relations, and cost estimates.
Reasoning: The LLM evaluates a cost-benefit analysis for each blocking object:
- Cost: Time to navigate to the object + manipulation effort (pick/place) + time to move it to a drop zone.
- Benefit: Improvement in global connectivity (measured by Betweenness Centrality) and reduction in future path lengths.
Decision: The LLM decides one of three high-level actions:
1. Move: Relocate a specific obstacle to a specific drop zone to clear a critical path.
2. Detour: Navigate around the obstacle if the cost of moving it outweighs the benefit.
3. Explore: Move to an unexplored room to find missing task-relevant objects.

C. Low-Level Execution

Once the LLM outputs a high-level plan (e.g., "Move Paper Towel to Black Box"), a standard motion planner (Dijkstra-based) and the robot's control stack execute the specific navigation and pick-and-place primitives. This ensures reliable, physics-compliant execution.

3. Key Contributions

Lifelong Interactive Navigation Problem: Formalizing the challenge of sequential navigation and manipulation in unknown, cluttered environments where decisions impact future feasibility.
Constraint-Based Planning Framework: A novel architecture that re-frames LLMs as constraint reasoners operating over structured scene graphs. This enables zero-shot generalization without task-specific fine-tuning.
Active Perception Coupling: The system tightly couples reasoning with perception, deciding where to look next based on task relevance and environmental constraints, rather than exhaustive mapping.
New Evaluation Metrics: Introduction of the Long-term Efficiency Score (LES), which balances Success Rate (SR), Time Steps (TS), and the Price of Clutter (PoC) (a metric quantifying how much the agent's actions degrade or improve future navigability).

4. Experimental Results

The approach was evaluated in the ProcTHOR-10k simulator and on a Boston Dynamics Spot robot.

Baselines: Compared against learning-based methods (InterNav), and heuristic baselines (Always Detour, Always Interact, Clean + Shortest Path).
Performance:
- Success Rate (SR): The proposed method achieved high success rates (e.g., ~94% in small rooms, ~62% in large rooms for the unknown variant), significantly outperforming learning-based baselines which failed in complex, multi-room settings.
- Long-term Efficiency (LES): The method achieved the highest LES, outperforming the strongest non-learned baseline by 20–50% and prior interactive methods by 3–6× in larger environments.
- Scalability: While baselines degraded rapidly as room count increased (1–10 rooms), the proposed method maintained stability by selectively manipulating only high-impact obstacles.
Ablation Studies:
- Manipulation Cost: The system dynamically adjusted its strategy; as the cost of moving objects increased, it became more selective, avoiding unnecessary manipulations while maintaining high success.
- LLM Choice: Different LLMs (Gemini, GPT-5, DeepSeek) were tested. Gemini performed best, highlighting that reasoning capabilities (not just language generation) are crucial for embodied tasks.
Real-World Deployment: The system was successfully deployed on a Boston Dynamics Spot robot, demonstrating effective sim-to-real transfer in a partially observed, cluttered indoor environment without hardware-specific fine-tuning.

5. Significance

This work represents a paradigm shift in embodied AI navigation:

From Reactive to Proactive: It moves beyond reactive detours or blind cleaning, enabling agents to strategically reshape their environment for long-term goals.
Zero-Shot Generalization: By using LLMs as constraint reasoners rather than action generators, the system generalizes to unseen environments and task sequences without retraining.
Sustainable Navigation: The introduction of the "Price of Clutter" metric and the LES score emphasizes that successful navigation in lifelong settings requires preserving the environment's structure, not just solving the immediate task.

In summary, the paper demonstrates that combining structured scene representation with LLM-driven constraint reasoning allows mobile manipulators to solve complex, long-horizon navigation tasks in cluttered, unknown environments with human-like strategic foresight.