DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping

Imagine you have a master chef who is an expert at cooking with a specific set of kitchen tools: a French knife, a wooden spoon, and a cast-iron skillet. If you suddenly hand them a pair of chopsticks, a ladle, and a wok, they might freeze. They know how to cook, but they don't know how to translate their muscle memory to these new, weirdly shaped tools.

This is the exact problem robots face today. We have built many different "robot hands" (dexterous hands), but they all look and move differently. If we train a robot brain to pick up a cup with a "Shadow Hand," that brain usually fails completely when we swap it for a "LEAP Hand" or a "Schunk Hand." It's like trying to drive a car by looking at the steering wheel of a boat.

DexGrasp-Zero is a new method that teaches a robot hand to be a "universal chef." It allows a single robot brain to learn how to grasp objects and then instantly use that skill on any new hand it sees, without needing to re-learn or practice.

Here is how they did it, using some simple analogies:

1. The Problem: The "Middleman" Trap

Previous methods tried to solve this by using a "middleman."

The Old Way: The robot brain would say, "I want my fingers to move to this specific spot in space." Then, a separate translator program would try to figure out how to move the specific robot hand to get there.
The Flaw: This is like telling a translator, "I want to go to the top of the mountain," without telling them which mountain you are standing on. The translator might try to walk up a cliff that doesn't exist for that specific hand, causing the robot to crash or try to bend a finger in a way that breaks it.

2. The Solution: Speaking a "Universal Language"

The authors realized that even though robot hands look different, they all share the same anatomy. They all have a wrist, a palm, a thumb, and fingers with joints that bend, spread, and twist.

Instead of teaching the robot to move "fingers," they taught it to speak a Universal Motion Language based on three simple movements:

Flex: Bending the finger inward (like making a fist).
Abduct: Spreading the finger out (like spreading your fingers wide).
Rotate: Twisting the finger.

Think of this like LEGO bricks. Whether you have a small LEGO set or a giant one, the basic bricks (Flex, Abduct, Rotate) are the same. The robot brain learns to build a "grasp" using these universal bricks, rather than trying to memorize the specific instructions for every single hand model.

3. The Secret Sauce: The "Physical Cheat Sheet"

Just knowing the universal language isn't enough. A tiny robot hand can't lift a heavy rock, and a giant hand might crush a grape.

The researchers gave the robot brain a "Physical Cheat Sheet" (called a Morphology-Aligned Graph).

Before the robot tries to grab something, it looks at the "cheat sheet" for the specific hand it is currently using.
This sheet tells the brain: "Hey, this hand has short fingers, so don't try to wrap around that big ball," or "This hand has strong motors, so you can squeeze harder."
It's like a GPS that knows exactly what kind of car you are driving. If you are in a tiny Mini Cooper, the GPS won't tell you to take a route with a low bridge; if you are in a truck, it won't tell you to take a narrow alley.

4. The Result: Zero-Shot Transfer

Because the robot brain learns the concept of grasping (using the universal bricks) and checks the physical limits (using the cheat sheet) for every new hand, it works instantly.

Training: They trained the brain on four different types of robot hands.
The Test: They then gave it two completely new hands it had never seen before.
The Outcome: The robot didn't need to practice. It just looked at the new hand, checked the cheat sheet, and successfully grabbed objects.
- Success Rate: It worked 85% of the time in simulation and 82% of the time in the real world.

Why This Matters

Imagine a future where a factory has 100 different robots, and a new one is added tomorrow. With old methods, you'd have to spend weeks teaching the new robot how to pick up a screw. With DexGrasp-Zero, you just plug the new robot in, and it already knows how to do the job because it understands the language of hands, not just the specific model.

It turns the robot from a "specialist" who only knows one tool into a "generalist" who can adapt to any tool in the shed.

1. Problem Statement

The rapid proliferation of diverse dexterous hand hardware (e.g., Shadow, Allegro, Leap, Inspire) has created a significant bottleneck in robotic manipulation: cross-embodiment generalization.

The Challenge: Existing Reinforcement Learning (RL) policies are typically morphology-specific. They rely on hand-specific state representations and output joint commands that cannot be directly transferred to hands with different kinematics, degrees of freedom (DoF), or physical constraints.
Limitations of Prior Work: Current cross-embodiment approaches often predict intermediate motion targets (e.g., 3D fingertip displacements or MANO poses) and rely on a separate, trainable retargeting module to convert these targets into physical joint commands for a specific hand.
- Issue 1: Intermediate targets may violate the kinematic or actuation limits of the target hand, leading to infeasible actions.
- Issue 2: Retargeting modules add complexity and require re-training or fine-tuning for every new hand, hindering true zero-shot transfer.

2. Methodology: DexGrasp-Zero

The authors propose DexGrasp-Zero, a universal policy that learns grasping skills directly from diverse embodiments and transfers them to unseen hands without re-training. The core innovation lies in aligning the state and action spaces across different morphologies using a Morphology-Aligned Graph Representation.

A. Morphology-Aligned State Representation

Instead of using raw joint angles or task-space coordinates, the method abstracts every hand into a semantic graph $G_h = (V_h, E_h)$ :

Nodes: Represent anatomical functional units (Wrist, Metacarpal, Proximal, Middle, Distal, Fingertip) rather than specific joints. This allows hands with different DoF counts to share the same semantic structure.
Edges: Encode the kinematic tree connectivity between these units.
Features: Each node includes dynamic state information (distance to object, joint angles/velocities, contact status, force) and semantic type. A global feature vector captures object-wrist relationships.

B. Hand-Agnostic Motion-Primitive Action Space

To avoid the need for retargeting, the policy outputs actions in a universal motion-primitive space $\alpha_{prim}$ , grounded in human hand biomechanics:

Wrist Command: A 6-DoF incremental displacement ( $\Delta p_w, \Delta \theta_w$ ).
Node-Level Primitives: For each anatomical node, the policy outputs three orthogonal motion primitives:
- Flexion (FLEX): Bending toward the palm.
- Abduction (ABD): Spreading away from the middle finger.
- Axial Rotation (ROT): Torsional rotation.

Mapping ( $M_h$ ): A fixed, deterministic, hand-specific linear mapping converts these primitives into executable joint commands. This mapping is derived from the hand's URDF (kinematic specification) and does not require learning.

C. Morphology-Aligned Graph Convolutional Network (MAGCN)

The policy is parameterized by a Graph Neural Network (GNN) designed to handle the graph state and inject physical constraints:

Physical Property Injection: The network parses the hand's URDF to extract static physical priors (link lengths, joint limits, axis directions, damping). These are encoded into a Physical Property Graph and fused into the GCN layers via a "Physical Property Injection" mechanism.
Layer-wise Fusion: Unlike early fusion (concatenating features at the input), MAGCN injects physical priors at every GCN layer. This allows the network to adaptively compensate for varying link lengths and actuation limits during message passing.
Activation Masking: An activation mask ensures the policy only outputs valid primitives for specific nodes (e.g., a wrist node cannot "abduct"), preventing physically impossible actions.

D. Sim-to-Real Transfer

To bridge the reality gap (specifically the lack of tactile/force sensors in real-world deployment), the authors use Privileged Distillation:

Teacher: Trained in simulation with access to privileged information (contact states, forces).
Student: Trained to mimic the teacher but uses only observable proprioceptive and visual data. The student employs an LSTM to infer missing contact information from temporal observation histories.

3. Key Contributions

Morphology-Aligned Graph & Primitives: A novel representation that aligns perception (state) and control (action) semantics across heterogeneous hands, eliminating the need for intermediate retargeting.
MAGCN with Physical Injection: A GCN-based policy that explicitly incorporates URDF-derived physical constraints into the learning process, enabling the policy to respect embodiment-specific limits.
Zero-Shot Cross-Embodiment Success: The first framework to achieve high-performance zero-shot transfer to unseen dexterous hands without any fine-tuning or re-training of the policy network.

4. Experimental Results

Simulation Results (YCB Benchmark)

Setup: Trained jointly on 4 diverse hands (Allegro, Shadow, Ability, Schunk); tested zero-shot on 2 unseen hands (LEAP, Inspire).
Performance:
- Unseen Hands: Achieved an 85% success rate, outperforming the state-of-the-art method (CrossDex) by 59.5% (which achieved only ~26.5%).
- Seen Hands: Achieved a 92% average success rate.
Ablation Studies: Removing motion primitives dropped unseen performance to 34%; removing physical priors dropped it to 80.5%; removing the activation mask/penalty dropped it to 63%. This confirms the necessity of all components.

Real-World Deployment

Setup: Deployed on three physical robot platforms with different hands: Kinova+LEAP, Kinova+Inspire, and Piper+Revo2.
Objects: 10 unseen objects (ranging from small tennis balls to large toy dogs).
Performance: Achieved an average success rate of 82% across all platforms on unseen objects.
- LEAP: 88%
- Inspire: 86%
- Revo2: 72% (lower due to smaller size and fewer DoFs).
Generalization: The policy successfully transferred to a non-anthropomorphic end-effector (Barrett Hand, 3-finger gripper) in simulation with a 70% success rate, demonstrating the framework's extensibility beyond human-like hands.

5. Significance

Paradigm Shift: Moves away from the "predict-then-retarget" paradigm to a "direct mapping" paradigm, significantly reducing complexity and failure modes caused by kinematic infeasibility.
Scalability: Enables the robotics community to train a single "universal" grasping policy that can be deployed on any new dexterous hand simply by providing its URDF file, drastically reducing the cost and time required for hardware integration.
Robustness: The explicit injection of physical constraints ensures that the learned policies are not just statistically probable but physically realizable, leading to stable grasps in the real world.

In conclusion, DexGrasp-Zero establishes a new standard for cross-embodiment dexterous manipulation by leveraging anatomical semantics and graph-based learning to achieve robust, zero-shot transfer across diverse robotic hardware.