FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

Imagine you are sending a tiny, autonomous drone (like a high-tech bee) into a burning building to save people. The drone needs to fly through smoke and darkness, build a mental map of the room, and find specific things like a fire extinguisher or an exit without anyone telling it exactly what those things look like in advance.

This is the problem the paper "FindAnything" solves. Here is a simple breakdown of how they did it, using some everyday analogies.

The Problem: The "Heavy Backpack"

Previous robots had two main ways of understanding the world:

The "Strict List" Robot: It only knew a fixed list of 100 things (chair, table, dog). If you asked it to find a "fire extinguisher," it would say, "I don't know what that is."
The "Super-Brain" Robot: It could understand any word you gave it (thanks to AI models like CLIP), but it was so heavy and slow that it needed a massive supercomputer to run. You couldn't put it on a small drone; the drone would crash from the weight.

FindAnything is the solution that gives the robot a "Super-Brain" but packs it into a "Lightweight Backpack" that fits on a tiny drone.

The Solution: The "Object-Centric" Map

Instead of trying to remember every single pixel of the room (which is like trying to memorize every grain of sand on a beach), the system uses a clever trick called Object-Centric Mapping.

Think of the robot's memory like a digital filing cabinet:

Old Way: The robot tries to write a detailed description of the entire room on every single page of a notebook. This fills up the notebook instantly.
FindAnything Way: The robot looks at the room and says, "That's a chair, that's a door, that's a fire extinguisher." It creates a single file card for each object.
- On the card for the "fire extinguisher," it stores a tiny, compressed "scent" (a mathematical feature) that represents what a fire extinguisher looks like.
- It ignores the empty space between the objects.

This is like organizing a library by book titles instead of trying to memorize every single letter in every book. It saves massive amounts of space.

The Secret Sauce: "Over-Segmentation"

How does the robot know where one object ends and another begins? It uses a tool called eSAM (a fast version of a segmentation AI).

Imagine the robot is looking at a car.

A normal AI might just see "Car."
FindAnything sees "Car," but also "Wheel," "Door," and "Headlight" as separate pieces.

Why do this? Because if a firefighter asks, "Where is the wheel?" the robot can find the specific wheel. If they ask, "Where is the car?" the robot can group all the wheel and door pieces together to find the whole car. It's like having a Lego set where you can build a specific piece or the whole castle, depending on what you need.

How It Works in Real Life

Flying and Scanning: The drone flies around, taking photos.
The "Glue": It uses a standard GPS/Navigation system (SLAM) to know where it is in 3D space.
The "Labeling": As it sees objects, it breaks them into pieces (segments) and attaches a "language tag" to them. It doesn't just store the word "Fire Extinguisher"; it stores a mathematical fingerprint of what a fire extinguisher looks like.
The Query: When a human operator types "Find the exit," the robot compares that word to the fingerprints in its map. It lights up the 3D map where the "exit" fingerprint matches, guiding the drone there.

Why This is a Big Deal

Speed: It runs in real-time. The drone doesn't stop to think; it just flies and maps.
Memory: It uses 60% less memory than previous methods. This means you can put it on a small, cheap drone (like a Micro Aerial Vehicle) instead of a giant robot.
Flexibility: You can ask it to find anything in English. "Find the red chair," "Find the broken window," or "Find the cat." It doesn't need to be pre-trained on those specific items.

The Bottom Line

FindAnything is like giving a tiny drone a human-like ability to understand language and objects, but stripping away the heavy baggage so it can fly fast and far. It turns a chaotic, unknown room into a searchable 3D database where you can ask, "Where is X?" and get an instant answer, making it perfect for dangerous jobs like search-and-rescue in fires or earthquakes.

Here is a detailed technical summary of the paper "FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment."

1. Problem Statement

Robots, particularly Micro Aerial Vehicles (MAVs), are critical for Search and Rescue (S&R) and disaster response in unknown, hazardous environments. Effective operation in these scenarios requires:

Real-time, large-scale mapping: The ability to build maps of complex environments (e.g., multi-story buildings) online.
Open-vocabulary semantic understanding: The map must allow operators to query for specific objects or concepts (e.g., "fire extinguisher," "exit") using natural language, without being limited to a pre-defined set of classes.
Resource constraints: The system must run on-board resource-constrained hardware (like MAVs with limited GPU and RAM) while maintaining geometric accuracy for navigation.

Current Limitations:

Class-based semantic maps: Traditional methods use a fixed set of classes, limiting expressiveness.
Vision-Language (VL) integration: While models like CLIP offer open-vocabulary capabilities, storing high-dimensional feature embeddings (hundreds of floats) at the voxel level in a 3D map leads to prohibitive memory usage and computational costs, making them unsuitable for large-scale, real-time deployment on edge devices.
Scalability: Existing open-vocabulary 3D mapping approaches often fail to scale to large environments or require offline training (e.g., NeRFs), preventing online robot deployment.

2. Methodology: FindAnything

The authors propose FindAnything, a framework that integrates vision-language information into dense volumetric submaps using an object-centric approach.

Core Architecture

The system consists of three main modules (see Fig. 2 in the paper):

VI-SLAM (State Estimation): Uses the OKVIS2-X system to estimate robot poses and track the environment. It employs a submap-based strategy (using the supereight2 framework) to handle large-scale environments and correct drift via loop closures.
Vision-Language Feature Extraction:
- Segmentation: Uses eSAM (Efficient Segment Anything Model), a lightweight foundation model, to generate object proposals (binary masks) from RGB images.
- Feature Encoding: Uses CLIP (ViT-L/14) to extract 768-dimensional feature embeddings for the image.
Object-Centric Volumetric Mapping:
- Instead of aggregating features at the voxel level, FindAnything aggregates them at the object/segment level.
- Segment Tracking: As the robot moves, segments from eSAM are tracked against the current map by projecting map objects into the image plane.
- Oversegmentation Strategy: The system employs an "as-fine-as-possible" strategy. It fuses new eSAM proposals with existing map segments, prioritizing smaller segments to allow fine-grained queries (e.g., distinguishing a "wheel" from a "car") while maintaining the ability to group them for broader concepts.
- Feature Fusion: For each segment ID $k$ , the system maintains a weighted average of the CLIP features ( $\bar{f}_k$ ) and a pixel count ( $N_k$ ). When a new frame is processed, features are updated using a weighted average formula (Eq. 1 & 2) based on the number of pixels associated with that segment in the current view. This decouples voxel resolution from language representation, significantly reducing memory usage.

Downstream Task Integration

The system integrates with an autonomous exploration planner. The planner uses the semantic map to guide the MAV:

Query Processing: Natural language queries are converted to CLIP embeddings.
Target Selection: The system calculates cosine similarity between the query embedding and the stored segment embeddings.
Sampling: The planner samples candidate next views inside 3D cubes centered on high-similarity segments, prioritizing exploration of areas of interest (e.g., searching for a "fire extinguisher").

3. Key Contributions

Memory-Efficient VL Aggregation: A novel method to aggregate high-dimensional VL features into a volumetric map by associating them with object segments rather than voxels. This allows for open-vocabulary queries without the massive memory overhead of voxel-wise feature storage.
Submap-Based Online Mapping: Integration of the object-centric VL approach with a submap-based Visual-Inertial SLAM system, enabling large-scale, drift-corrected mapping on resource-constrained platforms.
Oversegmentation Strategy: A dynamic tracking mechanism that splits and merges segments to balance fine-grained object part recognition with holistic object understanding.
Real-World Deployment: Successful demonstration of the system running completely online on a custom-built MAV (NVIDIA Jetson Orin NX) in a simulated Search and Rescue scenario.

4. Experimental Results

The system was evaluated on standard datasets (Replica, SemanticKITTI) and in real-world simulations.

Semantic Accuracy:
- On the Replica dataset, FindAnything achieved competitive state-of-the-art results (e.g., 62.71% f-mIoU with SLAM poses), outperforming many existing methods like ConceptFusion and RayFronts (NACLIP).
- On SemanticKITTI (large-scale outdoor), it achieved 53.90% f-mIoU at 0.1m resolution, significantly outperforming RayFronts which failed at this resolution due to GPU memory limits.
Efficiency & Scalability:
- Memory: FindAnything uses ~60% less memory than RayFronts (e.g., 16.23 GB vs. >24.5 GB on KITTI) by aggregating features at the segment level.
- Speed: It processes sequences significantly faster than competitors (e.g., 5m 24s for a Replica sequence vs. 9m 19s for RayFronts and 11h 12m for HOV-SG).
- Real-time Performance: On the onboard Jetson Orin NX, the system maintains real-time operation, with CLIP and eSAM inference taking ~340ms and ~171ms respectively, allowing for continuous exploration.
Downstream Task (Exploration):
- In autonomous exploration tasks (searching for "bed" or "bathroom"), FindAnything demonstrated higher mesh completeness and lower reconstruction RMSE compared to a baseline geometric-only explorer. It successfully guided the robot to specific areas of interest using natural language queries.

5. Significance

FindAnything represents a significant step forward in robotic autonomy for emergency response:

Bridging the Gap: It successfully bridges the gap between the flexibility of large foundation models (CLIP, SAM) and the strict computational constraints of onboard robotics.
Practical Deployment: It is the first system of its kind to demonstrate online, open-vocabulary 3D mapping on a resource-constrained MAV, proving that complex semantic reasoning can be performed in real-time without cloud dependency.
S&R Applicability: By enabling robots to understand and search for specific objects (like exits or tools) in unknown environments using natural language, it directly enhances the capabilities of robots in Search and Rescue missions, potentially saving lives by providing critical information to first responders faster.