Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

This paper presents an end-to-end, viewpoint-agnostic grasping pipeline for mobile legged manipulators that leverages vision-language models and partial observation compensation to achieve robust, language-guided object selection and safe execution in cluttered environments, outperforming view-dependent baselines with a 90% success rate.

Dilermando Almeida, Juliano Negri, Guilherme Lazzarini, Thiago H. Segreto, Ranulfo Bezerra, Ricardo V. Godoy, Marcelo Becker

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are a robot dog with a robotic arm, tasked with picking up a specific item from a messy, cluttered table. The table is covered in boxes, tools, and other objects. Some items are hidden behind others, and you can only see part of them from where you are standing.

This paper describes a new "brain" for that robot dog that makes it much better at this job, even when the view is blocked. Here is how it works, broken down into simple steps with some creative analogies.

The Problem: The "Blindfolded" Robot

Traditionally, robots try to grab things based only on what they can see right now.

  • The Analogy: Imagine trying to pick up a specific book from a messy shelf, but you are only allowed to look at it from one angle. If the book is half-hidden behind a lamp, you might guess where to grab it, but you could end up knocking the lamp over, or your hand might hit the shelf because you couldn't see the space around the book.
  • The Result: In the experiments, a standard robot (the "baseline") only succeeded 30% of the time. It kept crashing into things or couldn't reach the object because it didn't "know" what was hidden behind the clutter.

The Solution: The "Super-Imagination" Pipeline

The authors built a system that lets the robot understand the world in three dimensions, even when parts of it are missing. Here is the step-by-step process:

1. Listening and Finding (The "Smart Search")

Instead of the robot needing to know the exact coordinates of an object, a human just says, "Pick up the blue bottle."

  • How it works: The robot uses a special AI (a Vision-Language Model) that acts like a super-smart librarian. It looks at the picture and finds the "blue bottle" even if it's mixed in with other junk. It draws a digital box around it and then cuts out a precise "stencil" (mask) of just that object.

2. Filling in the Blanks (The "3D Puzzle Solver")

This is the magic part. Since the robot can only see the front of the bottle, the back is a mystery.

  • The Analogy: Imagine you are looking at a snowman from the front. You can see the nose and the eyes, but the back is hidden. A normal robot would try to grab the invisible back. This new system uses AI to imagine the rest of the snowman. It asks, "If I see this front, what does the back probably look like?"
  • How it works: The system takes the partial 3D data it has and uses two AI models to "hallucinate" (predict) the missing parts of the object. It fills in the holes, creating a complete, solid 3D model of the object, even though the robot never actually saw the back.

3. The Safety Check (The "Dance Rehearsal")

Now that the robot has a complete picture of the object, it needs to figure out how to grab it without crashing.

  • The Analogy: Before a dancer performs a complex move on a crowded stage, they rehearse the whole routine in their head to make sure they won't trip over props or hit the audience.
  • How it works: The robot simulates thousands of different ways to grab the object. It checks:
    • "If I grab it here, will my arm hit the box next to it?"
    • "Can my body actually reach that spot?"
    • "Is this a stable grip?"
      It picks the safest and most reachable option, discarding any that would cause a collision.

4. Moving the Body (The "Dance Step")

Sometimes, even with a perfect plan, the robot is standing in the wrong spot.

  • The Analogy: If you are trying to reach a high shelf but your feet are stuck, you can't get it. You have to take a step closer or move to the side.
  • How it works: If the robot realizes it can't reach the object from its current spot, it doesn't just give up. It moves its legs (repositions its base) to get a better angle, then extends its arm to grab the object.

The Results: A Big Win

The researchers tested this on a real robot dog (Boston Dynamics' Spot) in two messy scenarios:

  1. The Drill: A power drill hidden behind boxes.
  2. The Blue Bottle: A bottle tucked behind other items.
  • Old Way (View-Dependent): The robot failed 70% of the time. It either crashed into the clutter or couldn't reach the object because it didn't account for the hidden parts.
  • New Way (View-Agnostic): The robot succeeded 90% of the time.

Why This Matters

This paper shows that for robots to work in the real world (which is messy and full of hidden things), they can't just rely on what their cameras see right now. They need to:

  1. Understand language to know what to pick up.
  2. Use imagination to fill in the parts they can't see.
  3. Plan carefully to avoid crashing before they even move.

It's the difference between a robot that blindly reaches out and knocks everything over, and a robot that thinks, plans, and successfully picks up the item it was asked for.