SysNav: Multi-Level Systematic Cooperation Enables Real-World, Cross-Embodiment Object Navigation

Imagine you are dropped into a massive, unfamiliar office building with a mission: "Find the blue coffee mug sitting on the desk in the breakroom."

You have no map. You don't know where the breakroom is. You might even be a robot with wheels, a robot with four legs, or a robot that walks on two legs. How do you solve this without getting lost, wasting time, or crashing into walls?

This is the problem SysNav solves. The researchers at Carnegie Mellon University built a "brain" for robots that treats navigation not as a single, giant task, but as a three-level team effort. Think of it like a high-tech expedition with a Commander, a Navigator, and a Driver.

Here is how SysNav works, broken down into simple concepts:

1. The Problem: Why is this so hard?

Most robots try to learn navigation by just "watching and doing" (like a baby learning to walk). They try to map the whole building and make a decision for every single step at once.

The Flaw: In a complex real-world building, this is like trying to solve a giant jigsaw puzzle while blindfolded. It's too slow, and if the robot makes one wrong turn, it gets stuck.
The AI Trap: We have powerful AI (Vision-Language Models) that are great at understanding language and logic, but they are terrible at understanding 3D space. If you ask an AI to "walk to the chair," it might get confused by a pile of boxes or a weirdly shaped table.

2. The Solution: The Three-Level Team

SysNav splits the job into three distinct roles, so each part can do what it's best at.

🧠 Level 1: The Commander (High-Level Semantic Reasoning)

Role: The Big Picture Thinker.
How it works: Instead of looking at every single brick, the Commander builds a structured map of the building's "rooms." It knows, "This is a kitchen," "That is a bedroom," and "The fridge is usually in the kitchen."

The Analogy: Imagine you are looking at a city map. You don't care about the color of every car; you care about the neighborhoods. The Commander uses a super-smart AI (a Vision-Language Model) to look at the rooms and say, "The target is a 'chair'. Chairs are usually in living rooms or offices, not in the bathroom. Let's skip the bathroom."
The Magic: It only uses the AI's brain for big decisions (which room to enter next), not for tiny steps. This saves time and prevents the AI from getting confused by small details.

🗺️ Level 2: The Navigator (Mid-Level Room-Based Planning)

Role: The Route Planner.
How it works: Once the Commander says, "Go to the Bedroom," the Navigator takes over. It treats the room as the smallest unit of decision-making.

Inside the Room: The Navigator uses classic, fast, math-based algorithms to sweep the room like a vacuum cleaner, making sure no corner is missed. It doesn't need a super-smart AI for this; it just needs to be efficient.
Between Rooms: If the robot finishes a room and hasn't found the object, the Navigator asks the Commander again: "Okay, I checked the bedroom. Where should I go next?"
The "Early Stop" Trick: If the robot is in the Living Room and suddenly sees a chair that looks exactly like the target, the Navigator can say, "Wait! Stop looking at the sofa. We found it!" and switch tasks immediately.

🦶 Level 3: The Driver (Low-Level Motion Control)

Role: The Muscle.
How it works: This part just follows the orders. It takes the "Go to the door" command from the Navigator and figures out how to actually move the robot.

The Magic: Because the Commander and Navigator don't care how the robot moves, this system works on any robot. Whether it's a wheeled robot, a dog-like robot (Unitree Go2), or a human-like robot (Unitree G1), the "Driver" just adapts to the specific body type.

3. The Real-World Test: 190 Missions

The researchers didn't just test this in a video game. They built a real system and sent it out into the real world 190 times.

The Robots: They tested it on a wheeled robot, a quadruped (four-legged robot), and a humanoid robot.
The Scale: They navigated entire buildings, not just small rooms.
The Result: It was 4 to 5 times faster than previous methods and much more successful. It was the first system to reliably find objects in large, complex buildings across different types of robots.

Summary: Why is this a big deal?

Before SysNav, trying to navigate a real building with a robot was like trying to drive a car by looking at every single pebble on the road. You would crash or get tired.

SysNav is like giving the robot a GPS and a local guide:

The GPS (Commander) tells it which neighborhood to go to based on common sense.
The Local Guide (Navigator) sweeps the neighborhood efficiently.
The Car (Driver) just drives the vehicle.

By separating the "thinking" from the "moving," SysNav allows robots to finally navigate the messy, complex real world as reliably as we do. It's the difference between a robot that gets lost in a hallway and a robot that can find your lost keys in a multi-story office building.

1. Problem Definition

The paper addresses Real-World Object Navigation (ObjectNav), a task where an autonomous agent must locate and reach a specific target object (or an object satisfying specific semantic constraints) in an unknown indoor environment.

Key Challenges Identified:

Complexity: Real-world navigation involves complex spatial structures, long-horizon planning, and deep semantic understanding.
Limitations of End-to-End Learning: Existing approaches often rely on single-policy end-to-end models (mapping sensors directly to actions). These struggle with the diversity of real-world challenges and suffer from a scarcity of real-robot training data.
VLM Limitations: While Vision-Language Models (VLMs) offer strong semantic reasoning, they lack precise 3D spatial grounding and long-term spatial consistency. Over-relying on VLMs for fine-grained exploration leads to inefficiency (e.g., frequent backtracking).
Cross-Embodiment Gap: Most systems are tailored to specific robot morphologies, lacking generalizability across different platforms (e.g., wheeled, quadruped, humanoid).

The authors argue that ObjectNav should be treated as a system-level problem requiring a decoupled, multi-level architecture rather than a single learning policy.

2. Methodology: SysNav Architecture

SysNav is a three-level hierarchical system designed to decouple semantic reasoning, navigation planning, and motion control. This structure allows each module to specialize, ensuring robustness and generalizability.

A. High-Level: Semantic Reasoning

This layer constructs a structured representation of the environment and leverages VLMs for high-level guidance.

Structured Scene Representation: The environment is modeled as a graph with three layers of nodes:
1. Room Nodes ( $v_r$ ): Top-level units representing distinct rooms (identified via wall detection and point cloud analysis). Attributes include room masks, categories, and representative images.
2. Viewpoint Nodes ( $v_v$ ): Mid-level nodes representing visited locations with coverage regions. They store panoramic images and geometric data.
3. Object Nodes ( $v_o$ ): Bottom-level nodes for detected object instances (category, 3D point cloud, bounding box, attributes).
- Edges: The graph connects rooms, viewpoints, and objects based on connectivity, containment, visibility, and spatial relationships (e.g., "on top of").
VLM Reasoning: The VLM queries this structured graph to provide semantic-grounded guidance. It does not control low-level movement but answers high-level questions like "Which room should be explored next?" or "Should we stop exploring the current room?"

B. Mid-Level: Room-Based Navigation

This layer acts as the bridge, using the high-level guidance to plan global strategies while using classical algorithms for local efficiency.

Hierarchical Strategy: Rooms are treated as the minimal decision-making units for the VLM.
- Cross-Room Navigation: The VLM selects the next room to explore based on semantic priors (e.g., "fridges are in kitchens"). It also handles Early-Stop Navigation, deciding to switch rooms immediately if a target is spotted in a new room.
- In-Room Exploration: Once inside a room, the system switches to efficient, classical exploration algorithms (local and global planning with TSP solvers) to cover the space. This avoids the inefficiency of using VLMs for fine-grained, in-room path planning.
Modes:
- Room-query: Selects the best candidate room from unexplored areas.
- Early-stop: Interrupts current exploration if a new room contains a better candidate.

C. Low-Level: Base Autonomy

This layer executes the planned waypoints using embodiment-specific motion control.

Cross-Embodiment Design: It translates waypoints into specific motion commands (linear/angular velocity) for different robot types.
Features: Includes waypoint following, collision avoidance, and terrain traversability analysis.
Deployment: Successfully adapted for a wheeled robot, a quadruped (Unitree Go2), and a humanoid (Unitree G1).

3. Key Contributions

System-Level Formulation: Proposes a novel three-level architecture that explicitly decouples semantic reasoning, planning, and control, addressing the limitations of end-to-end policies in real-world settings.
Structured Scene Representation: Introduces a multi-granularity graph (Room-Viewpoint-Object) that organizes environmental data, enabling VLMs to reason effectively without being overwhelmed by raw sensor noise.
Hierarchical Room-Based Strategy: A critical innovation where VLMs are restricted to room-level decisions, while classical algorithms handle in-room exploration. This balances the semantic strengths of VLMs with the spatial efficiency of traditional methods.
Cross-Embodiment Generalization: Demonstrates a single system architecture working seamlessly across three distinct robot morphologies (wheeled, quadruped, humanoid).
Real-World Scale: Achieves the first reliable, building-scale long-range ObjectNav in complex real-world environments.

4. Experimental Results

Real-World Experiments

Scale: 190 real-world episodes conducted on three robot platforms (Mecanum wheeled, Unitree Go2, Unitree G1) across diverse building-scale environments.
Performance:
- Success Rate (SR): Achieved 100% on "Easy" tasks, 97.5% on "Medium," and 98.3% on "Hard" (multi-room) tasks.
- Efficiency: Outperformed baselines (VLFM, InstructNav) significantly. In hard settings, SR improved by 61.1% and Success Penalized by Time (SPT) by 51.1%. Average Time (AT) was reduced by ~30 seconds.
- Generalization: Successfully handled complex constraints (e.g., "find the white chair in the bedroom," "microwave near the refrigerator") across different robot bodies.

Simulation Benchmarks

Datasets: Evaluated on HM3D-v1, HM3D-v2, MP3D, and HM3D-OVON (8,195 episodes total).
State-of-the-Art (SOTA): SysNav achieved the best performance across all benchmarks.
- On HM3D-OVON, it improved Success Rate by 14.1% and SPL by 6.5% over the previous best (ApexNav).
- On HM3D-v2, it achieved an SR of 80.8% (vs. 76.2% for ApexNav).

5. Significance

First of Its Kind: SysNav is the first system to reliably and efficiently complete building-scale long-range object navigation in complex real-world environments.
Practical Viability: By decoupling VLM reasoning from low-level control, the system overcomes the latency and spatial grounding limitations of current VLMs, making it viable for real-time deployment.
Scalability: The modular design allows the system to be deployed on various robot embodiments without retraining the core navigation logic, paving the way for general-purpose service robots in complex indoor environments (e.g., hospitals, large offices).
Methodological Shift: The paper advocates for a shift from "single-policy" learning to "multi-level systematic cooperation," offering a blueprint for future real-world robotic navigation systems.