Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction

Imagine you are teaching a robot arm to pick up a specific toy from a messy table. You want to tell the robot, "Grab the red block," and have it know exactly where to reach in 3D space (left, right, up, down, forward, backward).

This paper is about giving a robot a "super-brain" that can do exactly that, using a single camera and a simple voice command.

Here is the breakdown of their work using some everyday analogies:

1. The Problem: The "Flat-Earth" Robot

Most robots are like people who have only ever looked at a flat map. They are great at seeing what something is (a cup, a shoe) and where it is on a 2D picture (top-left corner). But they struggle with the third dimension: How far away is it? How tall is it?

Standard "Vision-Language Models" (VLMs) are like incredibly well-read librarians. They know everything about the world because they've read the entire internet. They can tell you what a "glue stick" is. But if you ask them, "Where is the glue stick in 3D space so I can grab it?", they usually just guess or say, "I don't know."

2. The Solution: The "Specialized Intern"

The researchers took a smart, pre-trained AI (a VLM) and gave it a specific internship: 3D Position Estimation.

The Training: They didn't just talk to the AI; they showed it over 100,000 photos taken from a camera mounted on a robot's wrist. They taught it to look at an object and say, "That's a glue stick, and it is 13 millimeters to the left and 27 millimeters away."
The Trick (QLoRA & Routing): Imagine the AI is a Swiss Army Knife. The researchers didn't want to melt down the whole knife to add a new screwdriver. Instead, they used a technique called QLoRA to add a tiny, specialized attachment (a "regression head") that handles the math for 3D coordinates.
Conditional Routing: This is the clever part. They programmed the AI with a "traffic cop."
- If you ask, "What is the weather?" the traffic cop sends the question to the original, general brain (which knows about weather).
- If you ask, "Where is the object?" the traffic cop sends it to the new 3D attachment.
- Result: The robot stays smart about the world and gets good at grabbing things, without losing its general knowledge.

3. The Dataset: The "Robot Gym"

To train this, they built a custom gym. They used a robot arm with a camera on its wrist and moved it around a table with 750 different objects (from ice cream molds to sunglasses).

They took pictures from different angles and lighting conditions.
They made sure the robot saw objects from above, just like a human hand would reach down to grab them.
They were careful to keep the "height" of the objects secret during training, so the AI had to learn to guess the depth from the picture alone, rather than just memorizing the answer.

4. The Results: "Good Enough to Grab"

How well did it work?

The Score: The AI was off by a median of 13 millimeters (about half an inch).
The Comparison: This was 5 times better than a simpler model that hadn't been trained on this specific task.
The Real-World Test: In about 25% of the cases, the error was so small (within 10mm) that the robot could successfully grab or push the object without dropping it.

5. Where It Stumbles: The "Glue Stick" Problem

The researchers analyzed where the AI failed, and it makes perfect sense:

Tall, thin objects: Things like glue sticks or soda bottles are hard to judge from a top-down camera view because they look like small dots. It's like trying to guess the height of a pencil just by looking at its eraser end.
Weird shapes: Unusual items (like a weirdly shaped toy or a pair of sunglasses) confused the AI because it had mostly seen "normal" things on the internet.
The Z-Coordinate (Depth): The AI was best at guessing Left/Right (X and Y) but struggled more with Up/Down/Forward (Z). This is the hardest part of 3D vision with a single camera, much like trying to judge how far away a car is just by looking at a photo.

The Bottom Line

This paper proves that we can take a general-purpose AI (which knows everything) and give it a specific "robot arm" skill without breaking its general knowledge. It's like taking a brilliant professor and giving them a specialized tool to fix a car engine. They can still talk about history, but now they can also tighten a bolt with surprising accuracy.

While it's not perfect yet (it still struggles with weird angles and tall thin objects), it's a massive step toward robots that can understand human instructions and interact with our physical world intuitively.

1. Problem Statement

The paper addresses a critical gap in robotics: the lack of Vision-Language Models (VLMs) capable of estimating 3D object coordinates from a single monocular RGB image. While pre-trained VLMs possess rich world knowledge and strong 2D detection capabilities, they typically fail to output precise 3D spatial data required for robotic manipulation.

Specific Challenge: Estimating the 3D position of an object relative to a robot's base using only a wrist-mounted camera, natural language prompts, and robot state data.
Constraint: The solution must preserve the VLM's general-purpose capabilities (e.g., answering visual questions) while adding specialized 3D regression abilities, avoiding the "catastrophic forgetting" often seen in fine-tuning.

2. Methodology

A. System Architecture & Conditional Routing

The authors propose a hybrid architecture that dynamically routes visual features based on the input task:

Conditional Routing: A mechanism directs prompts to either the base model (for general visual queries) or the adapted model (for 3D position estimation).
- Implementation: A simple keyword trigger (the word "question") routes prompts to the base model to preserve general capabilities. Task-specific prompts are routed to the adapted architecture.
Parameter-Efficient Fine-Tuning (PEFT): The model utilizes QLoRA (Quantized Low-Rank Adaptation).
- The base model weights remain frozen and quantized.
- Only the LoRA matrices and a custom regression head are trained.
- This results in a final model size of 3.7 Billion parameters.

B. Dataset Collection

To overcome data scarcity, the authors curated a heterogeneous dataset:

Source: Collected using a Doosan A0509 6-joint robot arm with an RG2-FT gripper and a Logitech Brio Webcam mounted on the wrist.
Scale: Over 100,000 images covering ~750 unique objects.
Variability:
- Objects include unusual designs, irregular shapes, and varying lighting conditions.
- Trajectories include linear, curved, and triangular approaches.
- Settings include single-object and multi-object scenarios (60% single, 40% multi).
Data Split: A group-based split (90% train/val, 10% test) ensures all images of a specific object remain in the same set to prevent data leakage regarding object height.

C. Training Strategy

Loss Function: Huber loss was used for training and validation.
Evaluation Metrics: Mean Absolute Error (MAE) and Euclidean distance error.
Baseline Comparison: Compared against a standard LLaVA-v1.5 feature extractor with a single linear regression layer (no fine-tuning).

3. Key Contributions

First VLM for Monocular 3D Estimation in HRI: Demonstrates the feasibility of using general-purpose VLMs for precise 3D coordinate regression in robotic manipulation without sacrificing general visual understanding.
Conditional Routing Mechanism: A novel approach to enable specialized 3D tasks while maintaining access to the original quantized base model's full general capabilities.
Large-Scale Heterogeneous Dataset: Creation of a robust dataset (>100k images) specifically designed for wrist-mounted monocular 3D estimation, addressing the lack of such data in open-source robotics research.
Error Analysis Framework: A detailed investigation into failure modes, linking prediction errors to specific object properties (e.g., verticality, irregular shapes, lighting).

4. Results

The proposed method significantly outperformed the baseline:

Performance Gain: The fine-tuned model showed a 5-fold improvement over the non-finetuned baseline.
Accuracy Metrics (Test Set):
- Median MAE: 13 mm (Mean Absolute Error across coordinates).
- Median Euclidean Error: 27 mm.
Task Feasibility: Approximately 25% of predictions fell within a 10 mm error margin per coordinate, a range considered acceptable for successful robotic grasping or pushing tasks.
Model Efficiency: Achieved these results with a 3.7B parameter model, significantly smaller than many state-of-the-art VLA (Vision-Language-Action) models.

Error Analysis Findings

The study identified specific factors contributing to higher errors:

Vertical Objects: Objects like glue sticks or soda bottles were harder to predict due to limited top-down visibility.
Unconventional Designs: Objects with non-standard shapes (e.g., ice cream formers) caused higher errors, likely due to biases in the pre-trained base model's training data.
Z-Coordinate (Depth): The height ( $z$ ) prediction exhibited higher uncertainty and larger spread compared to $x$ and $y$ coordinates, a known limitation of monocular RGB inputs.
Lighting and Width: Unusual lighting and wide objects (where the camera gets too close to see the whole object) increased error rates.

5. Significance and Future Work

Significance: This work bridges the gap between high-level semantic understanding (VLMs) and low-level geometric control (3D coordinates). It proves that general-purpose models can be adapted for precise robotic tasks with minimal parameter updates, making them more accessible for diverse robotic platforms.
Limitations: The model is currently biased toward the specific robot workspace and camera model used during training.
Future Directions:
- Enhancing dataset heterogeneity with more multi-object settings and diverse robot models.
- Integrating robot proprioceptive data (joint angles, velocity) during late feature fusion.
- Developing learned routing strategies instead of keyword-based triggers.
- Simplifying prompt inputs for more intuitive human-robot interaction.