PM-Nav: Priori-Map Guided Embodied Navigation in Functional Buildings

The paper introduces PM-Nav, a novel framework that leverages priori-semantic maps and hierarchical chain-of-thought prompting to overcome the challenges of language-driven navigation in functional buildings with highly similar features, achieving substantial performance improvements over existing methods in both simulation and real-world environments.

Jiang Gao, Xiangyu Dong, Haozhou Li, Haoran Zhao, Yaoming Zhou, Xiaoguang Ma

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to find the "One-Stop Service Centre" in a massive, brand-new government office building. The building is huge, the hallways are all exactly the same, the doors look identical, and there are no unique decorations. If you asked a standard robot to "go find the service centre," it would likely get lost immediately, spinning in circles because every corner looks like every other corner.

This paper introduces PM-Nav, a new way to help robots navigate these confusing "Functional Buildings" (like schools, hospitals, and offices). Think of PM-Nav as giving the robot a super-powered GPS and a human-like thinking process combined.

Here is how it works, broken down into simple analogies:

1. The Problem: The Robot is "Blind" to the Big Picture

Current robots are like tourists who only look at their feet. They see a wall, then another wall, and try to guess where they are. In a house with unique furniture (a red sofa, a blue rug), this works fine. But in a functional building where every room is a white box with a white door, the robot gets confused. It lacks the "big picture" knowledge that humans use.

2. The Solution: The "Annotated Map" (The Priori-Map)

Instead of just showing the robot a raw photo of the building, the researchers convert the building's blueprint into a special, easy-to-read map.

  • The Analogy: Imagine you have a complex subway map. A normal map just shows lines. The PM-Nav map is like a subway map where every station is labeled with a number, and the tracks between them are named "Segment 13" or "Turn 2."
  • What it does: It turns a confusing visual mess into a clear, step-by-step list of instructions (e.g., "Go from Room 14 to Room 7 via Segment 13"). This helps the robot understand the logic of the building, not just the pictures.

3. The Brain: The "Chain of Thought" (H-CoT)

Once the robot has this special map, it doesn't just guess. It uses a step-by-step thinking process (called a Hierarchical Chain-of-Thought).

  • The Analogy: Think of a human giving directions to a friend. You don't just say "Go there." You say: "First, walk straight until you see the big red door. Then, turn left at the fork. Then, walk past the elevator."
  • What it does: The robot's AI (a Large Language Model) looks at the special map and breaks the journey down into tiny, logical steps. It plans the whole route before it even starts moving, ensuring it knows exactly which "fork" in the road to take.

4. The Hands and Eyes: The "Teamwork" System

This is where the magic happens. The robot doesn't rely on just one brain or one camera. It uses a three-person team to make decisions:

  1. The Strategist (VLM): This is the big brain. It looks at the panoramic view (a 360-degree photo) and says, "I think the target is roughly 30 degrees to the left." It's good at the big picture but a bit fuzzy on details.
  2. The Spotter (GroundingDINO & SAM): These are the sharp-eyed assistants. They look at the Strategist's guess and say, "Wait, I see a specific sign or door frame right there." They pinpoint the exact location.
  3. The Driver (PixelNav): This is the muscle. It takes the "rough guess" from the Strategist and the "exact pin" from the Spotter and calculates the perfect steering angle to move the robot forward without hitting a wall.

The Analogy: It's like a treasure hunt where one person holds the map (Strategist), another person spots the X on the ground (Spotter), and a third person actually digs the hole (Driver). They talk to each other constantly to make sure they don't dig in the wrong spot.

5. The Results: From "Lost" to "Master Navigator"

The researchers tested this in both computer simulations and a real university building.

  • The Old Way: Other robots (like SG-Nav) were like lost puppies. In difficult tasks, they succeeded 0% of the time. They just couldn't tell one room from another.
  • The PM-Nav Way: With the special map and the teamwork system, the robot's success rate skyrocketed. In some tests, it was 6 to 12 times better than the old methods. It could navigate complex, confusing buildings where other robots gave up.

Summary

PM-Nav is like giving a robot a human's ability to read a map and think ahead, combined with a team of specialists to spot landmarks and steer precisely. It solves the problem of "looking the same" in big buildings by turning the building's layout into a clear, logical story that the robot can follow step-by-step.