Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction

This paper presents a finetuned Vision-Language Model that leverages monocular RGB images, natural language, and robot states to estimate 3D object positions for human-robot interaction, achieving a median error of 13 mm and significantly outperforming non-finetuned baselines.

Ari Wahl, Dorian Gawlinski, David Przewozny, Paul Chojecki, Felix Bießmann, Sebastian Bosse

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot arm to pick up a specific toy from a messy table. You want to tell the robot, "Grab the red block," and have it know exactly where to reach in 3D space (left, right, up, down, forward, backward).

This paper is about giving a robot a "super-brain" that can do exactly that, using a single camera and a simple voice command.

Here is the breakdown of their work using some everyday analogies:

1. The Problem: The "Flat-Earth" Robot

Most robots are like people who have only ever looked at a flat map. They are great at seeing what something is (a cup, a shoe) and where it is on a 2D picture (top-left corner). But they struggle with the third dimension: How far away is it? How tall is it?

Standard "Vision-Language Models" (VLMs) are like incredibly well-read librarians. They know everything about the world because they've read the entire internet. They can tell you what a "glue stick" is. But if you ask them, "Where is the glue stick in 3D space so I can grab it?", they usually just guess or say, "I don't know."

2. The Solution: The "Specialized Intern"

The researchers took a smart, pre-trained AI (a VLM) and gave it a specific internship: 3D Position Estimation.

  • The Training: They didn't just talk to the AI; they showed it over 100,000 photos taken from a camera mounted on a robot's wrist. They taught it to look at an object and say, "That's a glue stick, and it is 13 millimeters to the left and 27 millimeters away."
  • The Trick (QLoRA & Routing): Imagine the AI is a Swiss Army Knife. The researchers didn't want to melt down the whole knife to add a new screwdriver. Instead, they used a technique called QLoRA to add a tiny, specialized attachment (a "regression head") that handles the math for 3D coordinates.
  • Conditional Routing: This is the clever part. They programmed the AI with a "traffic cop."
    • If you ask, "What is the weather?" the traffic cop sends the question to the original, general brain (which knows about weather).
    • If you ask, "Where is the object?" the traffic cop sends it to the new 3D attachment.
    • Result: The robot stays smart about the world and gets good at grabbing things, without losing its general knowledge.

3. The Dataset: The "Robot Gym"

To train this, they built a custom gym. They used a robot arm with a camera on its wrist and moved it around a table with 750 different objects (from ice cream molds to sunglasses).

  • They took pictures from different angles and lighting conditions.
  • They made sure the robot saw objects from above, just like a human hand would reach down to grab them.
  • They were careful to keep the "height" of the objects secret during training, so the AI had to learn to guess the depth from the picture alone, rather than just memorizing the answer.

4. The Results: "Good Enough to Grab"

How well did it work?

  • The Score: The AI was off by a median of 13 millimeters (about half an inch).
  • The Comparison: This was 5 times better than a simpler model that hadn't been trained on this specific task.
  • The Real-World Test: In about 25% of the cases, the error was so small (within 10mm) that the robot could successfully grab or push the object without dropping it.

5. Where It Stumbles: The "Glue Stick" Problem

The researchers analyzed where the AI failed, and it makes perfect sense:

  • Tall, thin objects: Things like glue sticks or soda bottles are hard to judge from a top-down camera view because they look like small dots. It's like trying to guess the height of a pencil just by looking at its eraser end.
  • Weird shapes: Unusual items (like a weirdly shaped toy or a pair of sunglasses) confused the AI because it had mostly seen "normal" things on the internet.
  • The Z-Coordinate (Depth): The AI was best at guessing Left/Right (X and Y) but struggled more with Up/Down/Forward (Z). This is the hardest part of 3D vision with a single camera, much like trying to judge how far away a car is just by looking at a photo.

The Bottom Line

This paper proves that we can take a general-purpose AI (which knows everything) and give it a specific "robot arm" skill without breaking its general knowledge. It's like taking a brilliant professor and giving them a specialized tool to fix a car engine. They can still talk about history, but now they can also tighten a bolt with surprising accuracy.

While it's not perfect yet (it still struggles with weird angles and tall thin objects), it's a massive step toward robots that can understand human instructions and interact with our physical world intuitively.