HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

Imagine you are teaching a robot to put a pair of shoes on a shelf. A simple robot might just see "two shoe-shaped blobs" and try to shove them anywhere. But a smart robot needs to know that the toe of the shoe must face left and the heel must face right. If it gets this wrong, the shoes look messy, and the task fails.

This is the problem the paper HeRO solves. It teaches robots to not just see the shape of an object, but to understand its parts and meaning, just like a human does.

Here is a simple breakdown of how they did it, using some everyday analogies:

1. The Problem: The "Blind Sculptor" vs. The "Art Critic"

Previous robot brains were like blind sculptors. They could feel the 3D shape of an object (like a point cloud) very well, but they didn't know what the parts were called. They knew "this is a bump" and "this is a curve," but they couldn't tell the difference between a shoe's toe and its heel.

Other robots were like art critics looking at a 2D photo. They knew the words "toe" and "heel," but because they were looking at a flat picture, they lost the 3D depth needed to grab the object correctly.

HeRO combines the best of both worlds. It gives the robot a 3D map that is also labeled with meaning.

2. The Secret Sauce: Mixing Two Super-Brains

To build this smart map, HeRO mixes two different types of "AI brains" (called Foundation Models):

Brain A (DINOv2): Think of this as a detective. It is great at spotting tiny, specific details. It can tell, "That pixel is definitely the lace," or "That pixel is the sole." It's very precise but can sometimes be a bit jittery or inconsistent.
Brain B (Stable Diffusion): Think of this as a painter. It understands the big picture and how things flow together. It knows that a shoe is a single, smooth object, even if the lighting changes. It's very smooth but sometimes misses tiny details.

The Magic Step (Dense Semantic Lifting):
HeRO takes the detective's sharp details and the painter's smooth understanding and blends them together. Then, it projects this blended "super-vision" onto the 3D shape of the object.

Analogy: Imagine taking a high-resolution 2D map of a city (the detective's view) and a smooth, artistic painting of that same city (the painter's view), and then wrapping them both around a 3D globe. Now, every point on the 3D globe knows exactly what it is (a park, a street, a building) and where it is in space.

3. The Brain's Organization: The "Global Manager" and "Local Specialists"

Once the robot has this smart 3D map, it needs to decide what to do. HeRO uses a Hierarchical Conditioning Module, which acts like a well-organized office:

The Global Manager: This looks at the whole object and the whole room. It says, "Okay, we have a shoe and a shelf. The general goal is to place the shoe."
The Local Specialists: This is the cool part. The robot breaks the object down into small chunks (like the toe, the heel, the side). It treats these chunks as a team of specialists.
- The Twist: In the past, robots got confused if the "toe specialist" was listed first or second in the computer code. HeRO uses a Permutation-Invariant system.
- Analogy: Imagine a team of doctors. It doesn't matter if Dr. Smith is introduced before Dr. Jones; they all work together to solve the problem. HeRO lets the robot look at the "toe part" and the "heel part" without caring about the order they appear in the code. This prevents the robot from getting confused if it sees a left shoe first or a right shoe first.

4. The Result: A Robot That "Gets It"

When the robot tries to place the shoes:

Old Robots (G3Flow): Might grab the shoe by the middle and spin it randomly because they can't distinguish the toe from the heel.
HeRO: Looks at its 3D map, sees the "toe" label, and says, "Ah, the toe needs to point left." It grabs the heel and aligns the toe perfectly.

Why This Matters

The paper tested this on hard tasks like hanging a mug by its handle or placing two shoes side-by-side.

The Result: HeRO improved success rates by 12.3% on the shoe task and 6.5% on average across many tasks.
Real World: They even tested it on a real robot arm in a real lab, and it worked just as well as in the computer simulation.

In a nutshell: HeRO gives robots a "3D brain" that understands not just what an object looks like, but what its parts are for, allowing them to manipulate objects with the same care and precision a human would use.

1. Problem Statement

Robotic imitation learning has evolved from 2D image-based policies to 3D geometric representations (e.g., point clouds) to better capture spatial relationships. However, existing 3D methods often suffer from a critical limitation: they lack explicit part-level semantics.

The Gap: Purely geometric policies struggle with pose-aware manipulation, where success depends on identifying and reasoning about specific functional parts of an object (e.g., distinguishing a shoe's "toe" from its "heel" or a mug's "handle" from its "body").
Current Limitations:
- 2D Methods: Fail to capture 3D spatial geometry.
- Existing 3D Methods (e.g., G3Flow): While they incorporate semantic features, they often produce holistic semantic fields where distinct parts (like a toe and heel) become indistinguishable due to feature smoothing or lack of fine-grained discrimination. This leads to task failures in scenarios requiring precise alignment.

2. Methodology: HeRO Framework

HeRO (Hierarchical Semantic Representation for Object manipulation) addresses these issues through a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. The framework consists of three core components:

A. Dense Semantic Lifting

This module constructs a unified 3D representation by fusing complementary features from two foundation models:

DINOv2: Provides discriminative, fine-grained visual features ideal for sparse correspondences and geometric precision.
Stable Diffusion (SD): Provides globally coherent semantic priors derived from large-scale generative training.

Process:
- Features from both models are extracted from 2D RGB frames, projected to a common dimension via PCA, and fused using learnable weights ( $\alpha$ and $\beta$ ).
- These fused 2D features are "lifted" to 3D by projecting object point cloud points onto the image plane and sampling the corresponding features via bilinear interpolation.
- Temporal Propagation: The semantic field is updated across time steps by applying rigid-body transformations (6D pose tracking) to point positions while preserving intrinsic semantic features, ensuring consistency throughout the manipulation sequence.

B. Hierarchical Conditioning Module (HCM)

To enable fine-grained reasoning, HeRO decomposes the global semantic field into a Global Field ( $F_G$ ) and a set of Local Fields ( $F_L$ ).

Local Field Construction: The global point cloud is partitioned into $K$ clusters (e.g., 8 parts) using PCA-based grouping. This aligns clusters with the object's main elongation axis, naturally discovering part boundaries (e.g., separating the toe from the heel).
Conditioning Strategy:
- Global Conditioning: A global vector is formed by concatenating scene features (from $F_G$ ), robot state features, and aggregated part features. This provides high-level context.
- Permutation-Invariant Part Conditioning: Since the order of local parts (e.g., which cluster is "toe" vs. "heel") is not fixed across different objects, standard concatenation introduces order-sensitive bias. HeRO employs a permutation-invariant approach:
  - Local features are encoded via PointNet.
  - Self-Attention (without positional embeddings) allows information exchange between parts while maintaining invariance to part ordering.
  - Cross-Attention: The refined part features are injected into the diffusion U-Net's denoising process, allowing the policy to attend to specific fine-grained details without being confused by part permutation.

C. Diffusion Policy Learning

The system trains a diffusion model to predict robot actions. The model is conditioned on the hierarchical features (Global + Local) and learns to denoise random Gaussian noise into precise manipulation trajectories.

3. Key Contributions

HeRO Framework: A novel architecture for part-level semantic perception that fuses DINOv2 and Stable Diffusion features to create dense 3D semantic fields that are both geometrically precise and semantically consistent.
Hierarchical Conditioning Module (HCM): A diffusion conditioning mechanism that integrates global context with permutation-invariant, part-aware local features, overcoming the limitations of holistic global conditioning.
State-of-the-Art Performance: Extensive validation in simulation and real-world settings demonstrating significant improvements over prior methods in pose-aware tasks.

4. Experimental Results

The authors evaluated HeRO on the RoboTwin 2.0 benchmark, focusing on six challenging pose-aware tasks (e.g., Place Dual Shoes, Hanging Mug).

Standard Benchmark:
- HeRO achieved a 32.3% average success rate, outperforming the previous best (G3Flow) by 6.6%.
- On the difficult "Place Dual Shoes" task, HeRO improved success rates by 12.3% over G3Flow (33.0% vs. 20.7%).
Cross-Object Generalization (Zero-Shot):
- Tested on unseen object instances. HeRO achieved 24.4% average success, a 6.7% lead over G3Flow. This proves the model learns abstract functional properties rather than memorizing specific visual instances.
Real-World Validation:
- Using a dual-arm AgileX Cobot with teleoperated demonstrations, HeRO achieved a 26.7% average success rate, significantly outperforming baselines (G3Flow: 16.7%, DP3: 6.7%).
Ablation Studies:
- The Part-aware Geometry Refinement module was identified as the most critical component, contributing the largest performance gain (from 23.1% to 27.6% when added to the baseline).
- The combination of all modules yielded the highest performance (30.2%).

5. Significance

Bridging Geometry and Semantics: HeRO demonstrates that combining discriminative geometric features (DINOv2) with coherent semantic priors (Stable Diffusion) is essential for robust robotic manipulation.
Solving the "Part-Level" Bottleneck: By explicitly modeling object parts through hierarchical fields and permutation-invariant conditioning, HeRO solves the ambiguity that plagues current 3D policies in tasks requiring precise orientation (e.g., aligning a shoe's toe).
Real-World Applicability: The successful transfer to real-world hardware validates that the proposed semantic representations are robust enough to handle noise, lighting variations, and physical dynamics, moving imitation learning closer to practical deployment.

In summary, HeRO establishes a new state-of-the-art by proving that fine-grained, hierarchical semantic understanding is not just an add-on but a necessity for complex, pose-aware robotic manipulation.