CuriousBot: Interactive Mobile Exploration via Actionable 3D Relational Object Graph

Imagine you are walking into a messy, cluttered room with a friend who has never been there before. Your goal is to find four hidden toys.

The Old Way (Traditional Robots):
Most robots today act like a person with a flashlight who can only look. They walk around, shining their light to see what's in front of them. If a toy is hidden inside a closed cabinet or under a pile of clothes, the robot just sees a wall or a shirt. It says, "I don't know what's there," and moves on. It's like trying to solve a puzzle by only looking at the box cover.

The New Way (CuriousBot):
The paper introduces CuriousBot, a robot that doesn't just look; it interacts. It acts like a curious child who knows that if something is blocking a view, they should move it, open it, or lift it to see what's behind.

Here is how CuriousBot works, broken down into simple concepts:

1. The "Mental Map" (The Actionable 3D Relational Graph)

Instead of just taking a picture, CuriousBot builds a special 3D mental map of the room. But this isn't just a drawing of where things are; it's a map of relationships.

Think of this map like a family tree for objects, but with instructions on how to interact with them:

The Nodes (The People): It knows "This is a cabinet," "This is a box," and "This is a toy."
The Connections (The Relationships): It understands that the toy is inside the cabinet and the cabinet is behind the chair.
The "Actionable" Part: This is the magic. The map doesn't just say "Chair." It says, "Chair: Push me to see what's behind." Or "Cabinet: Open me to see what's inside."

It's like having a treasure map where the X doesn't just mark the spot; it tells you, "Dig here," or "Move this rock first."

2. The Team of Four (How it Works)

CuriousBot is powered by four distinct "brains" working together:

The Eyes (SLAM): This is the robot's sense of balance and sight. It uses cameras to build a 3D model of the room as it walks, keeping track of where it is.
The Architect (Graph Constructor): This part takes the 3D model and turns it into that special "Family Tree" map. It figures out, "Oh, that box is under the table," and "That cloth is covering the bottle."
The Brain (Task Planner): This is the robot's logic center (powered by a Large Language Model, like a super-smart AI chatbot). It looks at the map and thinks: "I need to find the toys. The map says a toy is inside the cabinet, but the cabinet is behind the chair. Therefore, I must push the chair first, then open the cabinet."
The Hands (Low-Level Skills): These are the physical actions. Once the Brain decides what to do, the Hands execute it: pushing, lifting, opening, or even sitting down to look under a table.

3. The "Aha!" Moments (What it can do)

The paper shows the robot doing things that previous robots couldn't:

The "Push": It sees a chair blocking a hidden space. Instead of walking around it, it pushes the chair aside to reveal a toy behind it.
The "Lift": It sees a cloth on the floor. It lifts the cloth to check if a bottle is hiding underneath.
The "Flip": It finds a box and flips it over to see if something is inside.
The "Sit": It can even sit down (using a Spot robot) to get a lower angle and see under a table.

4. Why is this better than just using a "Smart Camera"?

The researchers tested CuriousBot against other AI systems that just look at images (like GPT-4V or LLaVa).

The "Smart Camera" approach: The AI looks at a picture of a closed cabinet and guesses, "Maybe there's a toy inside." It has to guess based on memory.
CuriousBot's approach: It has a structured map. It knows the relationship is "Inside." It doesn't guess; it plans a specific sequence of actions to verify.

The Result: CuriousBot was much more successful at finding hidden items (82% success rate) compared to the other methods, which often got stuck or gave up because they couldn't figure out the "hidden" parts of the room.

The Bottom Line

CuriousBot is a robot that treats the world like a puzzle to be solved by touching and moving things, not just looking. It builds a map that understands not just what objects are, but how they relate to each other and how to move them to find what's hidden. It's the difference between a robot that is afraid to touch a messy room and a robot that dives in, moves the furniture, and finds the lost toys.

Here is a detailed technical summary of the paper "CuriousBot: Interactive Mobile Exploration via Actionable 3D Relational Object Graph."

1. Problem Statement

Mobile robot exploration in complex household environments faces significant challenges due to occlusions (objects hidden inside cabinets, under furniture, or behind obstacles).

Limitations of Current Methods: Traditional approaches focus on active perception (moving the camera to minimize unknown space) but neglect active interaction (physically manipulating objects to reveal hidden spaces).
Limitations of Existing Interaction Methods: Prior works like RoboEXP focus on tabletop scenarios, failing to address the unique challenges of mobile exploration, which includes:
- Expanded exploration space: Larger areas requiring complex navigation.
- Complex occlusion relationships: Intricate spatial relations (e.g., inside, behind, under, of) that require reasoning beyond simple 2D views.
- Larger action space: The need to combine navigation with diverse manipulation skills (pushing, lifting, opening, flipping).
Goal: To develop a system that can autonomously explore, reason about occluded spaces, and interact with the environment to uncover unknown objects and spaces.

2. Methodology

The authors propose CuriousBot, a system built around a novel Actionable 3D Relational Object Graph. The framework consists of four core modules:

A. System Architecture

SLAM Module:
- Inputs: RGB-D observations and robot odometry.
- Function: Estimates camera poses using RTAB-Map to localize the robot and map the environment.
Graph Constructor:
- Object Detection & Segmentation: Uses open-vocabulary detectors (YOLO-World) and segmentation models (Segment Anything Model - SAM) to identify objects and generate 3D point clouds.
- Node Association: Associates new detections with existing graph nodes based on label consistency and Intersection over Union (IoU) of point clouds.
- Relation Construction: Establishes edges between nodes using two signals:
  - Interaction-driven: Infers relations based on the last action (e.g., "open" $\rightarrow$ "inside", "push" $\rightarrow$ "behind").
  - Geometry-driven: Uses 3D bounding box tests to infer static relations (e.g., "on").
- Voxel Map: Maintains a 3D voxel grid (labeled as unexplored, free, unknown, outside) to identify occluded regions. If a visible object blocks a large "unknown" region, it is marked as an obstruction in the graph.
Task Planner:
- Serialization: Converts the 3D graph into a text format via Depth-First Search (DFS), appending [obstruction] tags to nodes that block unknown spaces.
- Decision Making: Uses a Large Language Model (LLM, specifically GPT-4o) to analyze the serialized graph and generate high-level action plans (e.g., "Push chair," "Open cabinet").
Low-Level Skills:
- Executes specific manipulation primitives based on the planner's output: Open (drawers/cabinets), Flip (boxes), Lift (cloths), Push (large objects), Sit (to check under tables), and Collect.
- Skills utilize heuristics and impedance control to handle real-world uncertainties (e.g., retrying grasps).

B. Key Innovation: Actionable 3D Relational Object Graph

Unlike previous scene graphs that are static or lack occlusion reasoning, this graph is action-conditioned. It explicitly encodes:

Semantic & Geometric Info: Object labels and 3D structure.
Dynamic Relations: Relationships like inside, behind, under, of, on.
Obstruction Logic: Identifies which objects are blocking the view of unknown spaces, enabling the robot to decide how to interact (e.g., "Push the chair to see behind it").

3. Key Contributions

3D Relational Object Graph: A novel representation that encodes diverse object relations and occlusion logic, enabling mobile robots to reason about hidden spaces.
CuriousBot System: A complete pipeline integrating SLAM, graph construction, LLM-based planning, and low-level skills to perform interactive mobile exploration. It is the first system to simultaneously achieve Interactivity (active manipulation), Mobility (large-scale navigation), and Exploratory reasoning (discovering unknown spaces).
Comprehensive Evaluation: Extensive experiments demonstrating generalization across object instances (rigid, deformable, articulated), relations, and scene layouts.

4. Experimental Results

The system was evaluated in a 3m $\times$ 4m room with 12 object categories and 6 unique layouts.

Tasks: Five distinct exploration tasks were tested: Flipping Boxes, Opening Drawers, Checking Underneath, Pushing Boxes, and Lifting Cloth.
Performance Metrics:
- Success Rate: CuriousBot achieved an average of 82%, significantly outperforming baselines.
- Object Recovery (OR): 81.6% (percentage of ground truth objects found).
- Graph Editing Distance (GED): 1.28 (lower is better), indicating high accuracy in the constructed graph structure.
Baseline Comparisons:
- Compared against VLMs (LLaVa, Gemini, GPT-4o) fed directly with RGB images and Heuristics.
- Result: VLMs relying on 2D observations failed to reason about occlusions effectively (Average Success: 12–32%). CuriousBot's explicit 3D graph reasoning proved superior for task planning.
Failure Analysis:
- Perception Failure (3/50): Caused by SLAM inaccuracies or detector errors.
- Decision Failure (3/50): LLM choosing incorrect skills despite correct graph data.
- Action Failure (3/50): Physical execution issues (loose grasp, early release, interference).
Ablation Study: Reducing the number of few-shot examples provided to the LLM significantly degraded performance, confirming the necessity of the provided context.

5. Significance and Future Work

Significance: This work bridges the gap between active perception and active interaction in mobile robotics. It demonstrates that explicit 3D relational reasoning is more effective than relying solely on VLMs for complex exploration tasks involving occlusions.
Limitations:
- Skill acquisition currently relies on manual heuristics written by roboticists.
- The system does not update the scene graph dynamically after every interaction (static memory update).
- Complex relations (e.g., "next to") are not fully captured.
Future Directions: Developing scalable skill acquisition processes, implementing dynamic scene memory for long-term tracking, and using foundation models to automatically capture more complex object relations.

In conclusion, CuriousBot represents a significant step forward in autonomous mobile manipulation, proving that robots can effectively "think" about hidden spaces and physically interact with their environment to explore the unknown.

CuriousBot: Interactive Mobile Exploration via Actionable 3D Relational Object Graph

1. The "Mental Map" (The Actionable 3D Relational Graph)

2. The Team of Four (How it Works)

3. The "Aha!" Moments (What it can do)

4. Why is this better than just using a "Smart Camera"?

The Bottom Line

1. Problem Statement

2. Methodology

A. System Architecture

B. Key Innovation: Actionable 3D Relational Object Graph

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Missingness Bias Calibration in Feature Attribution Explanations

Why Is RLHF Alignment Shallow? A Gradient Analysis

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning