Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

Imagine you are a robot dog with a robotic arm, tasked with picking up a specific item from a messy, cluttered table. The table is covered in boxes, tools, and other objects. Some items are hidden behind others, and you can only see part of them from where you are standing.

This paper describes a new "brain" for that robot dog that makes it much better at this job, even when the view is blocked. Here is how it works, broken down into simple steps with some creative analogies.

The Problem: The "Blindfolded" Robot

Traditionally, robots try to grab things based only on what they can see right now.

The Analogy: Imagine trying to pick up a specific book from a messy shelf, but you are only allowed to look at it from one angle. If the book is half-hidden behind a lamp, you might guess where to grab it, but you could end up knocking the lamp over, or your hand might hit the shelf because you couldn't see the space around the book.
The Result: In the experiments, a standard robot (the "baseline") only succeeded 30% of the time. It kept crashing into things or couldn't reach the object because it didn't "know" what was hidden behind the clutter.

The Solution: The "Super-Imagination" Pipeline

The authors built a system that lets the robot understand the world in three dimensions, even when parts of it are missing. Here is the step-by-step process:

1. Listening and Finding (The "Smart Search")

Instead of the robot needing to know the exact coordinates of an object, a human just says, "Pick up the blue bottle."

How it works: The robot uses a special AI (a Vision-Language Model) that acts like a super-smart librarian. It looks at the picture and finds the "blue bottle" even if it's mixed in with other junk. It draws a digital box around it and then cuts out a precise "stencil" (mask) of just that object.

2. Filling in the Blanks (The "3D Puzzle Solver")

This is the magic part. Since the robot can only see the front of the bottle, the back is a mystery.

The Analogy: Imagine you are looking at a snowman from the front. You can see the nose and the eyes, but the back is hidden. A normal robot would try to grab the invisible back. This new system uses AI to imagine the rest of the snowman. It asks, "If I see this front, what does the back probably look like?"
How it works: The system takes the partial 3D data it has and uses two AI models to "hallucinate" (predict) the missing parts of the object. It fills in the holes, creating a complete, solid 3D model of the object, even though the robot never actually saw the back.

3. The Safety Check (The "Dance Rehearsal")

Now that the robot has a complete picture of the object, it needs to figure out how to grab it without crashing.

The Analogy: Before a dancer performs a complex move on a crowded stage, they rehearse the whole routine in their head to make sure they won't trip over props or hit the audience.
How it works: The robot simulates thousands of different ways to grab the object. It checks:
- "If I grab it here, will my arm hit the box next to it?"
- "Can my body actually reach that spot?"
- "Is this a stable grip?"
  It picks the safest and most reachable option, discarding any that would cause a collision.

4. Moving the Body (The "Dance Step")

Sometimes, even with a perfect plan, the robot is standing in the wrong spot.

The Analogy: If you are trying to reach a high shelf but your feet are stuck, you can't get it. You have to take a step closer or move to the side.
How it works: If the robot realizes it can't reach the object from its current spot, it doesn't just give up. It moves its legs (repositions its base) to get a better angle, then extends its arm to grab the object.

The Results: A Big Win

The researchers tested this on a real robot dog (Boston Dynamics' Spot) in two messy scenarios:

The Drill: A power drill hidden behind boxes.
The Blue Bottle: A bottle tucked behind other items.

Old Way (View-Dependent): The robot failed 70% of the time. It either crashed into the clutter or couldn't reach the object because it didn't account for the hidden parts.
New Way (View-Agnostic): The robot succeeded 90% of the time.

Why This Matters

This paper shows that for robots to work in the real world (which is messy and full of hidden things), they can't just rely on what their cameras see right now. They need to:

Understand language to know what to pick up.
Use imagination to fill in the parts they can't see.
Plan carefully to avoid crashing before they even move.

It's the difference between a robot that blindly reaches out and knocks everything over, and a robot that thinks, plans, and successfully picks up the item it was asked for.

Here is a detailed technical summary of the paper "Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations."

1. Problem Statement

Robotic grasping in cluttered, unstructured environments remains a significant challenge, particularly for mobile legged manipulators (e.g., quadrupeds with arms). The core difficulties include:

Partial Observations: Severe occlusions and limited viewpoints lead to incomplete 3D geometry and unreliable depth estimates.
Semantic Ambiguity: In open-world scenarios, targets are specified via natural language rather than pre-segmented IDs.
Execution Feasibility: A geometrically valid grasp on a visible surface often fails in practice due to hidden geometry, collision risks with surrounding clutter, and kinematic constraints (reachability).
Pipeline Fragmentation: Existing solutions often treat perception, grasp prediction, and execution as separate modules, lacking a unified, execution-aware framework that bridges semantic intent to safe physical action.

2. Methodology

The authors propose an end-to-end, viewpoint-agnostic pipeline that integrates Vision-Language Models (VLMs), 3D geometry completion, and safety-oriented grasp selection. The system runs on a Boston Dynamics Spot robot equipped with an arm and RGB-D sensors.

The pipeline consists of four main modules:

A. Detection and Segmentation (Perception)

Input: Natural language command (e.g., "blue bottle") and RGB images.
Process:
1. Open-Vocabulary Detection: Uses Grounding DINO to localize the target based on the text query, returning a bounding box.
2. Instance Segmentation: Passes the bounding box to SAM 2 (Segment Anything Model 2) to generate a pixel-accurate binary mask.
3. Tracking: SAM 2 tracks the mask across frames; if tracking fails, Grounding DINO re-initializes detection.
Output: A robust instance mask ( $M$ ) used to isolate the target from clutter.

B. Point Cloud Generation and Estimation (Geometry)

Extraction: The mask is applied to RGB-D data (processed via Isaac ROS Nvblox) to extract a partial, object-centric point cloud ( $P_{partial}$ ).
Depth Compensation: A back-projection step fills small depth holes and attenuates outliers in the image plane to handle sparse returns.
Two-Stage Completion: To address severe occlusions and missing back-sides:
1. MGPC (Multimodal Generative Point Cloud): Generates synthetic points conditioned on the text prompt, RGB image, and $P_{partial}$ . This infers missing global geometry.
2. PoinTr: Densifies the geometry by completing fixed-size local patches to refine surface normals and structure.
Output: A dense, completed point cloud ( $P_{complete}$ ) representing the full object shape.

C. Grasp Pose Generation and Selection

Candidate Generation: Uses the Grasp Pose Generator (GPG) to sample 1,000 antipodal 6-DoF grasp candidates on $P_{complete}$ .
Collision Filtering: Candidates are validated against the local scene point cloud to reject any that cause collisions with surrounding objects.
Heuristic Ranking: The remaining candidates are ranked using a weighted cost function ( $C(g_i)$ $C (g_{i})$ ) that optimizes for:
- Alignment: Minimizing angular deviation from the robot's base-to-target direction.
- Approach Bias: Penalizing kinematically difficult approaches (e.g., from below).
- Centrality: Preferring grasps near the object centroid.
- Reachability: Hard constraints on maximum arm reach.
Output: A single, execution-feasible grasp pose ( $g^*$ ).

D. Execution and Motion Control

Base Repositioning: If the selected grasp is unreachable from the current stance, the robot moves its base to a new waypoint to satisfy reachability and clearance constraints.
Manipulation: The arm executes a pre-grasp approach (safety offset), followed by a Cartesian insertion and gripper closure.
Control: Managed by a finite-state machine coordinated via ROS 2.

3. Key Contributions

Unified End-to-End Framework: Bridges natural language target specification directly to safe physical execution on a mobile legged robot.
Execution-Aware Grasp Selection: Unlike standard pipelines, this method explicitly incorporates approach feasibility, clearance, and whole-body kinematic limits into the selection process.
Occlusion-Resilient Geometry Estimation: Combines depth back-projection with a two-stage completion pipeline (MGPC + PoinTr) to reconstruct full object geometry from highly partial observations.
Mobile Locomanipulation Strategy: Integrates base repositioning with grasp planning to dynamically improve accessibility in cluttered scenes.
Real-World Validation: Demonstrated on a quadruped robot in realistic, cluttered tabletop scenarios, outperforming view-dependent baselines.

4. Results

The system was evaluated on a Boston Dynamics Spot with an arm in two cluttered scenarios:

Scenario A: Grasping a power drill partially obscured by boxes.
Scenario B: Grasping a blue bottle behind obstacles.

Performance Comparison (10 trials per method):

Metric	Proposed Method (Viewpoint-Agnostic)	Baseline (View-Dependent)
Total Success Rate	90% (9/10)	30% (3/10)
Scenario A (Drill)	4/5 Success	0/5 Success
Scenario B (Bottle)	5/5 Success	3/5 Success

Failure Analysis:

The Baseline failed primarily due to approach collisions (FM-2/FM-3) because it relied on incomplete, single-view geometry that did not account for hidden obstacles or approach trajectories.
The Proposed Method had only one failure (reachability in Scenario A), demonstrating that completing the geometry and planning for base repositioning significantly reduces collision risks.

5. Significance and Conclusion

This work demonstrates that robust grasping in unstructured environments requires a holistic approach that connects semantic understanding with geometric completion and execution constraints.

Key Insight: Simply predicting a grasp on visible surfaces is insufficient; the system must infer hidden geometry and verify that the approach trajectory is collision-free.
Impact: The pipeline enables mobile robots to operate effectively in "open-world" settings where objects are partially hidden and targets are defined by natural language, moving closer to true autonomy in inspection and intervention tasks.
Future Work: The authors plan to extend evaluation to more diverse objects, improve onboard computation (reducing reliance on external workstations), and enhance depth robustness for thin or reflective objects.