WildOS: Open-Vocabulary Object Search in the Wild

Imagine you are sending a robot on a mission to find a specific object, like a "red fire hydrant" or a "blue house," in a massive, wild forest or a messy city street. You don't have a map, and the robot can only see a few meters ahead with its "eyes" (sensors). Everything beyond that is a foggy mystery.

This is the problem WildOS solves. It's a new system that teaches robots how to explore the wild world not just by feeling their way around obstacles, but by thinking about what they see, much like a human would.

Here is how it works, broken down into simple concepts and analogies:

1. The Problem: The "Tunnel Vision" Robot

Most robots today are like people wearing thick foggy glasses. They can see the ground right in front of them perfectly (geometric sensing), but once they look past a few meters, they are blind.

The Old Way: If a robot sees a fence blocking the direct path to a goal, it just turns around and tries to go around the fence blindly. It doesn't know if there's a nice open gate 50 meters away that it can't see yet. It's "myopic" (short-sighted).
The Vision Problem: Some robots try to use cameras to see far away, but they have no memory. They might see a path, take a step, forget they saw it, and then walk in circles, going back and forth over the same ground.

2. The Solution: WildOS (The "Smart Explorer")

WildOS combines two superpowers: Geometric Memory (a map of where it's been) and Visual Reasoning (a brain that understands images).

Think of WildOS as a hiker with two tools:

A Sketchbook (The Navigation Graph): Instead of drawing a detailed, heavy map of every tree and rock (which takes too much memory), the robot draws a simple connect-the-dots map. It marks safe spots and the edges of what it knows. This is light, fast, and remembers where it has already been so it doesn't get lost.
A Crystal Ball (ExploRFM): This is the robot's "brain" based on advanced AI. It looks at the camera image and predicts three things far beyond what the robot can physically touch:
- Is it safe to walk there? (e.g., "That looks like water, not grass.")
- Is there a path ahead? (e.g., "I see a gap between those trees.")
- Is that the object I'm looking for? (e.g., "That blurry shape in the distance looks like a house.")

3. How They Work Together: The "Scored Map"

The magic happens when the robot combines its sketchbook with its crystal ball.

The Scenario: The robot is at a fork in the road. One path goes straight toward the goal but hits a wall. The other path curves away but looks like it leads through a beautiful, open meadow.
The Decision:
- A dumb robot would just go straight because the goal is in that direction.
- A WildOS robot looks at the "meadow" path. Its AI says, "That path is safe, it's open, and it looks promising." It gives that path a high score.
- It then updates its sketchbook, marking that path as the best way to go, even though it's not the straight line to the goal.

4. Finding the "Invisible" Target

What if the target (like a "NASA sign") is 200 meters away, far beyond the robot's sensors?

The Trick: The robot takes a picture, sees the sign, and then uses a technique called Triangulation. Imagine holding your thumb up and closing one eye, then the other; your thumb seems to jump. The robot does this with its cameras from different spots.
Even though it can't measure the exact distance with lasers (because it's too far), it uses math to guess, "Okay, based on where I saw it from here and there, the sign is probably over there." It creates a "probable location" and starts walking toward that guess.

5. Real-World Results

The researchers tested this on a real robot (a Boston Dynamics Spot dog) in messy off-road areas and cities.

The Test: They asked the robot to find things like a "garbage can," a "golf cart," or a "NASA logo."
The Result: WildOS was much faster and smarter than robots that only used maps or only used cameras.
- When it hit a dead end, it remembered it had been there and turned back to try a different path (unlike the camera-only robot, which kept walking in circles).
- It found shortcuts through gaps in fences that other robots missed because they were too focused on the straight line to the goal.

The Big Picture

WildOS is like giving a robot a human-like intuition. It doesn't just react to what's touching its feet; it looks ahead, understands the scene, remembers where it's been, and makes smart guesses about where to go next. It's a giant step toward robots that can truly explore the wild world on their own, finding lost items or inspecting dangerous areas without needing a human to hold their hand.

1. Problem Statement

The paper addresses the challenge of long-range, open-vocabulary object search in large, unstructured outdoor environments (e.g., off-road, urban ruins) where:

No prior maps exist: The robot must explore from scratch.
Sensing is limited: LiDAR and depth sensors have a short effective range ( $r_{max} \approx 10$ m) due to sparsity and noise at long distances.
Semantic reasoning is required: The robot must locate objects described by natural language (e.g., "find the house") rather than specific geometric coordinates.
Memory is critical: Purely geometric methods are myopic (blind to obstacles beyond the sensor horizon), while purely vision-based methods often lack spatial memory, leading to oscillation and repeated exploration of dead-ends.

The core objective is to unify geometric safety (short-range) with semantic visual reasoning (long-range) to enable efficient, robust navigation toward distant, language-specified goals.

2. Methodology: The WildOS Architecture

WildOS is a unified, real-time system that integrates a sparse navigation graph with a foundation-model-based vision module. The architecture consists of five key components:

A. Navigation Graph Construction (Geometric Memory)

Structure: Instead of dense voxel maps, WildOS uses a sparse topological graph ( $G_{nav}$ ) to maintain spatial memory efficiently.
Nodes & Edges: Nodes represent collision-free locations; edges encode traversability costs.
Frontier Identification: Nodes at the boundary between known and unknown space are identified as geometric frontiers ( $F_{geo}$ ).
Update Mechanism: The graph is updated incrementally using local traversability maps derived from LiDAR, maintaining "free radius" and "explored radius" for each node to prevent redundant sampling and ensure connectivity.

B. ExploRFM (Vision-Language Module)

To reason beyond the LiDAR horizon, WildOS employs ExploRFM (Exploration and Object Reasoning Foundation Model), built upon the RADIO vision foundation model. Given an RGB image and a text query, it predicts three dense maps:

Visual Traversability ( $T_{vis}$ ): A semantic safety score for each pixel (distinguishing safe terrain like grass from unsafe terrain like water/bushes).
Visual Frontiers ( $F_{vis}$ ): Highlights regions in the image that appear navigable and lead to novel observations (e.g., trail ends, gaps between obstacles).
Object Similarity ( $S_{vis}$ ): A mask localizing regions in the image matching the open-vocabulary query (e.g., "NASA logo").

C. Coarse Goal Triangulation

Since the target object may be far beyond the LiDAR range, WildOS uses a particle-filter-based estimator to localize the goal in 3D space:

It samples 3D particles based on the object similarity mask ( $S_{vis}$ ) across multiple camera views.
Particles are weighted by their alignment with camera rays (triangulation).
This produces a coarse 3D goal estimate ( $\hat{p}_{goal}$ ) even when the object is not yet in the depth sensor's range, enabling directed planning.

D. Vision-Geometric Fusion (Scoring)

The system fuses geometry and vision by projecting geometric frontier nodes onto the image plane and scoring them using ExploRFM outputs. The scoring function ( $f_{score}$ ) considers:

Goal Confidence ( $G_{conf}$ ): Alignment between the frontier heading and the estimated goal direction.
Reachability Confidence ( $R_{conf}$ ): The minimum-cost path in the image space from the projected frontier to a visual frontier, constrained by visual traversability.
Frontier Confidence ( $F_{conf}$ ): The raw visual frontier score.
Result: A Scored Navigation Graph ( $G_{score}$ ) where each frontier node has a semantic score indicating its potential to lead to the goal safely.

E. Hierarchical Planning

High-Level Planner: Uses Dijkstra's algorithm on the scored graph to find a path to the coarse goal estimate. It treats the goal as an auxiliary node connected to frontier nodes with costs modulated by the visual-semantic scores.
Low-Level Planner: A local planner (ROS 2 Nav2) executes safe, dynamically feasible motions to the next waypoint.

3. Key Contributions

WildOS System: A unified framework for long-range, open-vocabulary search that bridges the gap between geometric safety and semantic awareness.
ExploRFM Module: A novel network that jointly predicts traversability, visual frontiers, and object similarity in image space, leveraging foundation models for zero-shot generalization.
Vision-Scored Graph: A topological mapping approach that scores geometric frontiers with semantic cues, prioritizing exploration toward visually promising regions while maintaining geometric safety.
Beyond-Horizon Localization: A particle-filter-based triangulation method that estimates 3D goal locations for objects outside the robot's depth sensing range.
Field Validation: Extensive real-world experiments demonstrating superior performance over state-of-the-art baselines in complex off-road and urban terrains.

4. Experimental Results

The system was deployed on a Boston Dynamics Spot robot with onboard computation (NVIDIA Jetson AGX Orin).

Object Search (Q1): Successfully navigated ~150m to find a "NASA logo" and other objects ("orange flag," "garbage container") using only natural language queries, demonstrating robust zero-shot generalization.
Navigation Efficiency (Q2): Compared against LRN (pure vision) and Vanilla GraphNav (pure geometry):
- WildOS significantly outperformed both in terms of trajectory length and time.
- Vanilla GraphNav failed to detect openings between obstacles (e.g., fences) until it was too late, leading to inefficient detours.
- LRN suffered from oscillation and failed to account for traversability (e.g., selecting a visual path that led to a wall or water).
Robustness & Memory (Q3): In a "dead-end" scenario:
- WildOS successfully recognized the blockage, turned back using its graph memory, and rerouted to the correct path.
- LRN (memoryless) oscillated indefinitely at the dead-end, requiring human intervention.
Generalization (Q4): The system generalized effectively from off-road training data to urban environments (navigating between buildings) without retraining, highlighting the power of foundation model features.

5. Significance and Impact

Bridging the Horizon Gap: WildOS solves the critical limitation of current robots being "blind" beyond their immediate sensor range by using vision to "see" affordances and frontiers at a distance.
Semantic Autonomy: It moves beyond simple geometric exploration, enabling robots to make human-like decisions (e.g., "take the path through the trees rather than the wall") based on semantic understanding.
Scalability: The use of sparse graphs and foundation models allows for memory-efficient, scalable exploration in environments spanning kilometers, which is essential for search-and-rescue and planetary exploration.
Real-World Viability: The system operates fully onboard in real-time, proving that complex vision-language reasoning can be integrated into field robotics without cloud dependency.

In summary, WildOS represents a significant step toward robust, long-horizon autonomy in unstructured environments by effectively unifying the "what" (semantic reasoning) with the "where" (geometric planning).