GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion

Imagine you are teaching a robot to pick up a coffee mug from a messy table. This seems simple to us, but for a robot, it's a nightmare of math and physics. It has to figure out where the mug is, how to hold it without crushing it, and how to move its arm smoothly to get there.

The paper "GraspLDP" introduces a new way to teach robots this skill. Think of it as upgrading the robot's brain from a "guess-and-check" learner to a "visionary expert" with a built-in safety net.

Here is the breakdown using simple analogies:

1. The Problem: The Robot's "Blind" Struggle

Previous methods tried to teach robots in two ways, both of which had flaws:

The "Map Reader" (Grasp Detectors): These are like GPS systems that tell the robot, "The handle is there." They are great at finding the spot, but they don't know how to move the arm to get there. They just point and say "Go!"
The "Dancer" (Diffusion Policies): These are like dancers who learned by watching thousands of videos. They can move their arms beautifully and adapt to new situations. But, they sometimes get the details wrong. They might reach for the mug but miss the handle, or grab it at a weird angle and drop it. They are good at the dance, but bad at the grip.

The Issue: When you combine them, the "Map Reader" just gives the "Dancer" a piece of paper with a coordinate on it. The dancer ignores the paper because it's too abstract, or the paper doesn't match the messy reality of the room.

2. The Solution: GraspLDP (The "Architect & The Builder")

The authors created GraspLDP, which acts like a perfect partnership between an Architect and a Builder.

The Architect (The Grasp Detector): This is the expert who looks at the object and says, "To pick this up, you need to hold it exactly here, at this specific angle." It creates a "blueprint" of the perfect grip.
The Builder (The Latent Diffusion Policy): This is the robot's arm, learning how to move.

The Magic Trick: The "Latent Space" (The Secret Language)
Instead of the Architect just shouting coordinates to the Builder (which the Builder often ignores), they speak a secret language called "Latent Space."

Imagine the Architect draws the blueprint on a special piece of transparent film.
The Builder doesn't just look at the film; they wear it. The blueprint is fused directly into the Builder's mind.
As the Builder moves, the blueprint guides every muscle twitch, ensuring the arm naturally flows toward the perfect grip without the Builder having to "think" about the math.

3. The "Flashlight" (Visual Graspness Cue)

Sometimes, the room is dark, or the table is cluttered. The robot might get confused.

GraspLDP adds a Flashlight (called a "Graspness Cue").
This isn't just a regular light; it's a "magic highlighter" that glows only on the parts of the object that are safe to grab.
Even if the robot is confused by a shadow or a weird angle, the Flashlight says, "Hey, look here! This is the safe spot!"
The Self-Correction: The robot is also trained to "reconstruct" this flashlight image in its mind. If it loses track of the light, it tries to redraw it. This forces the robot to pay attention to the most important parts of the image, ignoring the clutter.

4. The "Smart Selector" (Heuristic Pose Selector)

The Architect (Grasp Detector) might suggest 10 different ways to grab the mug. Which one should the robot use?

Old Way: Pick the one that looks "best" on paper, even if it's on the other side of the room (requiring a long, awkward reach).
GraspLDP Way (The Smart Selector): It acts like a smart traffic cop. It looks at the 10 options and asks two questions:
1. "Is this a good grip?" (Quality)
2. "Is it close to where my arm is right now?" (Proximity)
  It picks the option that is a good grip AND easy to reach, ensuring the movement is smooth and safe.

5. The Results: Why It Matters

In their tests, this new method was a huge success:

Better Accuracy: It grabbed objects much more precisely than previous robots, rarely missing the handle.
Better Generalization: If you put a mug in a weird spot, or change the lighting, or use a totally new object (like a weird-shaped bottle), GraspLDP figured it out immediately. It didn't need to relearn everything.
Dynamic Grasping: It could even grab moving objects (like a banana being tossed in the air). Because it updates its "blueprint" in real-time, it can chase and catch moving targets, whereas older robots would just freeze or miss.

Summary

GraspLDP is like giving a robot a superpower: it combines the "perfect grip knowledge" of a human expert with the "smooth movement skills" of a dancer. By fusing these two into a single, secret language (Latent Diffusion) and using a "magic flashlight" to highlight safe spots, the robot can pick up almost anything, anywhere, even in the dark or while things are moving. It's a major step toward robots that can actually help us in our messy, real-world homes.

1. Problem Statement

Robotic grasping is a critical subtask in manipulation, yet policies learned via imitation learning (such as Diffusion Policy) often struggle with precision and generalization.

Limitations of Current Methods:
- General-purpose policies (e.g., standard Diffusion Policy, OpenVLA) treat grasping as part of a holistic action sequence. This often leads to imprecise grasp poses, collisions, and poor performance when encountering novel objects or visual disturbances (lighting changes).
- Data-centric approaches (e.g., GraspVLA) rely on massive synthetic datasets (billions of frames) to learn grasping, resulting in high computational costs, high inference latency, and low action frequencies unsuitable for dynamic scenes.
- Prior-integration approaches often treat grasp poses merely as conditional inputs concatenated with observations. This leads to weak correlation between the input pose and the output action sequence, failing to effectively guide the policy or extract spatial grasp distributions.

The core challenge is how to integrate precise grasp prior knowledge (from specialized detectors) into a closed-loop visuomotor policy without sacrificing inference speed or generalization capabilities.

2. Methodology: GraspLDP

The authors propose GraspLDP, a two-stage framework that integrates grasp priors into a Latent Diffusion Policy. The method decomposes the action generation process into a static target grasp pose and a dynamic motion policy, both projected into a shared latent space.

Key Components:

Latent Action Space & VAE:
- Instead of denoising raw action chunks, the policy operates in a compact latent space.
- A lightweight Variational Auto-Encoder (VAE) encodes action chunks into latent vectors ( $Z$ ).
- The diffusion model predicts noise in this latent space. The final action is reconstructed by an asymmetric decoder that takes both the denoised latent vector and the target grasp pose ( $G$ ) as input: $\hat{A} = D(Z \oplus G)$ .
- This allows the grasp pose to directly steer the action generation in a high-level semantic space rather than just as a low-level condition.
Visual Graspness Cue:
- To bridge the gap between the grasp pose and visual inputs, the system uses a Graspness Map (a point-wise measure of graspability generated by a pre-trained detector like AnyGrasp).
- This map is back-projected onto the wrist-view RGB image to create a visual cue ( $O_{cue}$ ), highlighting graspable regions.
- Self-Supervised Reconstruction: During the reverse diffusion process, the model is trained to reconstruct the masked wrist-view image ( $O_{cue}$ ) from intermediate representations. This auxiliary objective forces the policy to attend to the graspness cues, ensuring the generated trajectory aligns with feasible grasp regions.
Heuristic Pose Selector (HPS):
- At inference, the pre-trained detector provides multiple candidate grasp poses. Selecting the wrong one can degrade performance.
- HPS filters candidates based on collision detection and Non-Maximum Suppression (NMS).
- It then selects the optimal grasp ( $G^*$ ) by jointly optimizing for Grasp Quality (detector score) and Kinematic Proximity (SE(3) geodesic distance to the current end-effector pose). This ensures smooth, feasible trajectories.

3. Key Contributions

Latent Diffusion Framework for Grasping: A novel architecture that disentangles grasp pose prediction from motion policy learning, projecting both into a shared latent space for more effective guidance.
Graspness-Guided Denoising: Introduction of a visual graspness cue and a self-supervised reconstruction objective that explicitly directs the policy's attention to graspable regions, enhancing robustness against visual noise.
Heuristic Pose Selector: A selection mechanism that balances grasp quality and kinematic feasibility, significantly improving success rates in dynamic and cluttered environments.
Efficiency: The method achieves high precision without the massive computational overhead of large-scale VLA models, maintaining low inference latency suitable for real-time control.

4. Experimental Results

The method was evaluated on the LIBERO benchmark (simulation) and a Franka Emika real robot setup.

Simulation Results:

In-Domain Performance: GraspLDP achieved an 80.3% success rate (SR), outperforming Diffusion Policy (62.8%) and fine-tuned OpenVLA (57.5%).
Generalization:
- Object Generalization: 58.2% SR (vs. 11.4% for Diffusion Policy).
- Visual Generalization: 64.6% SR (vs. 16.3% for Diffusion Policy).
- GraspLDP showed significant gains in spatial (22.2%), object (46.8%), and visual (48.3%) generalization compared to baselines.
Grasp Frame Error (GFE): GraspLDP demonstrated significantly lower GFE, indicating higher precision in aligning the gripper with the target pose.

Real-World Results:

General Grasping: Achieved 84.0% SR in in-domain/spatial tasks and 78.7% average SR, comparable to the specialized AnyGrasp detector but with closed-loop control.
Cluttered Scenarios: In scenes with 5–8 objects, GraspLDP achieved a 92.3% Scene Completion Rate (SCR), matching AnyGrasp and significantly outperforming Diffusion Policy (15.4%).
Dynamic Grasping: Unlike static-trained policies that fail on moving objects, GraspLDP successfully tracked and grasped moving objects (e.g., moving bananas, watermelons) by synchronously updating the guiding grasp pose.

Efficiency:

Inference latency increased by only ~15% compared to vanilla Diffusion Policy (due to the lightweight VAE and latent decoding), whereas GraspVLA remained significantly slower (154ms higher latency).

5. Significance

GraspLDP represents a significant step forward in robotic manipulation by effectively bridging the gap between specialized grasp detection and general-purpose imitation learning.

Practicality: It offers a solution that is both highly accurate and computationally efficient, making it viable for real-time, dynamic robotic applications.
Generalization: By leveraging grasp priors in a latent space, it overcomes the "out-of-distribution" failures common in standard visuomotor policies, particularly regarding novel objects and lighting changes.
Future Impact: The framework provides a scalable foundation for robotic foundation models, demonstrating that integrating domain-specific priors (like graspness) into generative policies yields superior performance without the need for massive, expensive datasets.

Limitations: The authors note that the method still faces challenges with highly deformable or fragile objects (e.g., eggs), suggesting future work should incorporate tactile and force/torque feedback.