AdaClearGrasp: Learning Adaptive Clearing for Zero-Shot Robust Dexterous Grasping in Densely Cluttered Environments

Imagine you are at a crowded dinner party, and you need to grab a specific red cup from the center of a table. The problem? The table is a disaster zone. It's piled high with plates, napkins, a bowl of fruit, and other cups, all jumbled together. The red cup is buried underneath.

If you just reach in blindly, you'll likely knock everything over, spill the punch, and fail to get the cup. If you try to move everything off the table just to get that one cup, you'll waste time and risk breaking the expensive china.

AdaClearGrasp is a new robotic "brain" designed to solve exactly this kind of messy problem. It teaches a robot how to be a smart, patient, and dexterous waiter who knows exactly when to move things aside and when to just grab the target.

Here is how it works, broken down into simple concepts:

1. The "Smart Manager" (The VLM Planner)

Think of the robot's high-level brain as a Smart Manager who can see the whole table and understand language.

The Job: When you tell the robot, "Get me the red cup," the Manager doesn't just rush in. It looks at the mess and asks: "Is the cup blocked? If so, by what? Do I need to move the orange, or just the napkin?"
The Analogy: It's like a human looking at a cluttered desk. You don't just grab the pen; you might first slide a stack of papers to the left or push a coffee mug out of the way. The Manager decides which objects to move and how to move them before the robot's hand even touches anything.

2. The "Toolbox" (Atomic Skills)

Once the Manager makes a plan, it doesn't try to invent a new way to move things every time. Instead, it uses a pre-made Toolbox of simple, reliable moves.

The Moves: These are basic actions like "Push left," "Pull right," "Lift up," or "Reset hand."
The Analogy: Imagine the Manager is a conductor, and the robot's arm is an orchestra. The conductor doesn't tell the violinist how to hold the bow; they just say, "Play a C-sharp." Similarly, the Manager says, "Push the orange to the left," and the robot's low-level system knows exactly how to execute that push safely.

3. The "Intuitive Grabber" (GeoGrasp)

Once the path is clear, the robot needs to actually grab the object. This is where GeoGrasp comes in.

The Magic: Most robots need to be taught specifically how to grab a cup, then separately how to grab a ball, then a cube. GeoGrasp is different. It doesn't care what the object is (a cup or a shoe); it only cares about the shape and geometry.
The Analogy: Think of it like a human hand. You don't need to study a specific apple to know how to grab it; your brain just recognizes the curve and the size. GeoGrasp is trained to feel the "shape" of an object. Because of this, if you train it on a cube and an apple, it can instantly grab a pear or a Lego brick it has never seen before. It's zero-shot learning—it figures it out on the fly without needing a new lesson.

4. The "Safety Net" (Closed-Loop Feedback)

Robots aren't perfect. Sometimes they slip, or the object moves unexpectedly.

The System: AdaClearGrasp is a closed-loop system. This means it constantly checks its own work.
The Analogy: Imagine you are trying to pick up a slippery bar of soap. If you miss, you don't just keep trying the exact same motion until you break your hand. You stop, look at the soap, adjust your grip, and try again.
If the robot tries to push an object and it gets stuck, the "Manager" sees the failure, says, "Okay, that didn't work. Let's try pulling instead," and replans immediately. This prevents the robot from getting stuck in a loop of failure.

5. The "Training Ground" (Clutter-Bench)

To prove this works, the researchers built a special test called Clutter-Bench.

The Test: They created a video game-like simulation with three levels of messiness:
- Level 1: A few scattered items (Easy).
- Level 2: A medium pile (Medium).
- Level 3: A mountain of objects (Hard).
They tested the robot on 210 different scenarios. The results showed that while other robots gave up or knocked everything over in the messy levels, AdaClearGrasp successfully grabbed the target most of the time by intelligently clearing the path first.

Why This Matters

Before this, robots were either too clumsy to handle messy rooms or too rigid to adapt when things went wrong.

Old Way: "I see a cup. I will try to grab it." (Result: Crash).
AdaClearGrasp Way: "I see a cup buried under a pear. I will push the pear aside, check if the cup is free, and then grab it. If I slip, I'll try again."

In short, AdaClearGrasp gives robots the common sense to clean up their workspace before doing the job, the intuition to grab anything based on its shape, and the patience to fix mistakes when things go wrong. It's a huge step toward robots that can actually help us in our messy, real-world kitchens and living rooms.

Here is a detailed technical summary of the paper "AdaClearGrasp: Learning Adaptive Clearing for Zero-Shot Robust Dexterous Grasping in Densely Cluttered Environments."

1. Problem Statement

Robotic manipulation in densely cluttered environments faces three primary challenges:

Physical Interference: Surrounding objects block the target, making direct grasping impossible.
Visual Occlusions: Targets are hidden, preventing accurate pose estimation.
Unstable Contacts: Aggressive clearing strategies (e.g., pushing everything aside) risk damaging objects or the robot, while doing nothing leads to failure.

Existing methods often fail because they either rely on end-to-end Reinforcement Learning (RL) that struggles with long-horizon reasoning in dense scenes, or they use open-loop Vision-Language Models (VLMs) that lack the ability to recover from execution failures or handle complex physical interactions. The core problem is how to adaptively decide whether to clear obstacles or grasp directly, and how to execute this decision robustly in a closed-loop manner.

2. Methodology: AdaClearGrasp Framework

The authors propose AdaClearGrasp, a closed-loop, hierarchical framework that integrates high-level semantic reasoning with low-level geometric control.

A. Hierarchical Architecture

The system operates in a loop consisting of four stages:

VLM-Based Semantic Planner:
- Input: RGB images, language instructions, and execution feedback (e.g., "stuck," "failed").
- Model: Uses Qwen3-VL-32B-Instruct.
- Function: Analyzes the scene to determine if the target is occluded. It generates a high-level plan (JSON format) deciding whether to clear (push/pull/move obstacles) or grasp directly.
- Safety: Includes structured constraints (bounded action parameters) and a reasoning field for interpretability.
Model Context Protocol (MCP) Server:
- Acts as a bridge between the VLM and the robot.
- Translates high-level semantic commands into parameterized atomic skills (e.g., push(side="left", dist=0.1)).
- Decouples the reasoning logic from hardware-specific implementations, allowing modular skill expansion.
Atomic Skill Library:
- Clearing/Recovery Primitives: Deterministic geometric motion planners for pushing, pulling, lifting, and resetting the arm/hand to safe states upon failure.
- GeoGrasp (RL Policy): A specialized dexterous grasping policy for the final target acquisition.
Closed-Loop Execution:
- Visual feedback monitors the outcome of every action.
- If a step fails (e.g., gripper slip, object stuck), the system triggers replanning. The VLM receives the failure feedback and generates a new strategy (e.g., changing the clearing direction or resetting the arm).

B. GeoGrasp: Geometry-Aware RL Policy

To achieve zero-shot generalization across diverse objects, the authors designed GeoGrasp, an RL-based policy with a unique observation space:

Observation Space ( $S \in \mathbb{R}^{59}$ ): Instead of relying on object categories or textures, it uses relative hand-object geometry:
- 18 unit-normalized nearest-neighbor vectors from hand keypoints to the object point cloud.
- Target height and end-effector state (TCP).
Reward Function: A dense reward combining lifting height, success (lifting > 15cm), contact stability, and nearest-neighbor distance reduction.
Training: Trained via PPO on three objects (Cube, Cup, Apple) in a clutter-free environment.
Key Advantage: By grounding the policy in local geometric relations rather than appearance, it generalizes to unseen object shapes and physical properties without fine-tuning.

3. Key Contributions

AdaClearGrasp Framework: A novel closed-loop system that treats clutter clearing as an adaptive high-level planning problem, integrating VLM reasoning with structured atomic skills and failure-driven replanning.
GeoGrasp Policy: An object-agnostic, geometry-aware RL policy that enables robust zero-shot dexterous grasping across diverse object geometries by focusing on local hand-object relations.
Clutter-Bench: The first standardized simulation benchmark for language-conditioned dexterous grasping in clutter.
- Scope: 7 target objects (YCB dataset) across 3 difficulty levels (2, 4, and 6 obstacles).
- Scale: 210 simulated scenarios and 18 real-world scenarios.

4. Experimental Results

Simulation Performance (Clutter-Bench)

Success Rates: AdaClearGrasp achieved average success rates of 89% (Level-1), 84% (Level-2), and 76% (Level-3).
Baseline Comparison: The VLM Scaffolding baseline (no clearing logic) achieved only 6% average success, dropping to 0% in dense clutter.
Ablation Studies:
- Removing the adaptive clearing strategy (Direct GeoGrasp) caused success rates to drop to 27% in Level-3.
- Removing the closed-loop replanning mechanism caused a significant drop (from 76% to 41% in Level-3), proving the necessity of feedback-driven recovery.

Zero-Shot Generalization

GeoGrasp was trained on 3 objects but tested on 7.
It achieved high success on seen objects (e.g., Apple: 100%, Cup: 88.9%) and maintained strong performance on unseen objects (e.g., Can: 83.3%, Ball: 61.1%, Lego: 47.2%), validating the geometry-centric approach.

Sim-to-Real Transfer

Setup: Tested on an xArm7 robot with an XHand gripper and FoundationPose for 6D pose estimation.
Results: Achieved an overall 70% success rate across 90 real-world trials (3 objects × 3 clutter levels).
Performance: 90% success in sparse clutter, dropping to 50% in the densest scenarios (6 obstacles).
Significance: The system transferred directly from simulation to reality without fine-tuning, demonstrating the robustness of the geometry-aware formulation against the reality gap (sensor noise, friction, dynamics).

5. Significance

Paradigm Shift: Moves beyond "end-to-end" or "open-loop" approaches by explicitly modeling the decision to clear as a high-level reasoning task coupled with low-level execution.
Robustness: The closed-loop feedback mechanism allows the robot to recover from real-world uncertainties (slippage, collisions) that typically cause open-loop systems to fail.
Generalization: Demonstrates that focusing on geometric relationships rather than semantic object classes is a viable path for zero-shot dexterous manipulation in complex, unstructured environments.
Benchmarking: Clutter-Bench provides a necessary standard for evaluating future research in robotic manipulation under clutter, addressing the lack of graded difficulty in previous benchmarks.

In conclusion, AdaClearGrasp represents a significant step toward deploying dexterous robots in real-world, messy environments by combining the reasoning capabilities of modern VLMs with the robustness of geometry-aware reinforcement learning.