HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

This paper presents HeRO, a diffusion-based policy that achieves state-of-the-art performance in pose-aware robotic manipulation by employing hierarchical semantic fields to fuse geometry-sensitive DINOv2 features with globally coherent Stable Diffusion correspondences, thereby enabling precise part-level semantic understanding for complex object handling.

Chongyang Xu, Shen Cheng, Haipeng Li, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

Published 2026-02-24
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot to put a pair of shoes on a shelf. A simple robot might just see "two shoe-shaped blobs" and try to shove them anywhere. But a smart robot needs to know that the toe of the shoe must face left and the heel must face right. If it gets this wrong, the shoes look messy, and the task fails.

This is the problem the paper HeRO solves. It teaches robots to not just see the shape of an object, but to understand its parts and meaning, just like a human does.

Here is a simple breakdown of how they did it, using some everyday analogies:

1. The Problem: The "Blind Sculptor" vs. The "Art Critic"

Previous robot brains were like blind sculptors. They could feel the 3D shape of an object (like a point cloud) very well, but they didn't know what the parts were called. They knew "this is a bump" and "this is a curve," but they couldn't tell the difference between a shoe's toe and its heel.

Other robots were like art critics looking at a 2D photo. They knew the words "toe" and "heel," but because they were looking at a flat picture, they lost the 3D depth needed to grab the object correctly.

HeRO combines the best of both worlds. It gives the robot a 3D map that is also labeled with meaning.

2. The Secret Sauce: Mixing Two Super-Brains

To build this smart map, HeRO mixes two different types of "AI brains" (called Foundation Models):

  • Brain A (DINOv2): Think of this as a detective. It is great at spotting tiny, specific details. It can tell, "That pixel is definitely the lace," or "That pixel is the sole." It's very precise but can sometimes be a bit jittery or inconsistent.
  • Brain B (Stable Diffusion): Think of this as a painter. It understands the big picture and how things flow together. It knows that a shoe is a single, smooth object, even if the lighting changes. It's very smooth but sometimes misses tiny details.

The Magic Step (Dense Semantic Lifting):
HeRO takes the detective's sharp details and the painter's smooth understanding and blends them together. Then, it projects this blended "super-vision" onto the 3D shape of the object.

  • Analogy: Imagine taking a high-resolution 2D map of a city (the detective's view) and a smooth, artistic painting of that same city (the painter's view), and then wrapping them both around a 3D globe. Now, every point on the 3D globe knows exactly what it is (a park, a street, a building) and where it is in space.

3. The Brain's Organization: The "Global Manager" and "Local Specialists"

Once the robot has this smart 3D map, it needs to decide what to do. HeRO uses a Hierarchical Conditioning Module, which acts like a well-organized office:

  • The Global Manager: This looks at the whole object and the whole room. It says, "Okay, we have a shoe and a shelf. The general goal is to place the shoe."
  • The Local Specialists: This is the cool part. The robot breaks the object down into small chunks (like the toe, the heel, the side). It treats these chunks as a team of specialists.
    • The Twist: In the past, robots got confused if the "toe specialist" was listed first or second in the computer code. HeRO uses a Permutation-Invariant system.
    • Analogy: Imagine a team of doctors. It doesn't matter if Dr. Smith is introduced before Dr. Jones; they all work together to solve the problem. HeRO lets the robot look at the "toe part" and the "heel part" without caring about the order they appear in the code. This prevents the robot from getting confused if it sees a left shoe first or a right shoe first.

4. The Result: A Robot That "Gets It"

When the robot tries to place the shoes:

  • Old Robots (G3Flow): Might grab the shoe by the middle and spin it randomly because they can't distinguish the toe from the heel.
  • HeRO: Looks at its 3D map, sees the "toe" label, and says, "Ah, the toe needs to point left." It grabs the heel and aligns the toe perfectly.

Why This Matters

The paper tested this on hard tasks like hanging a mug by its handle or placing two shoes side-by-side.

  • The Result: HeRO improved success rates by 12.3% on the shoe task and 6.5% on average across many tasks.
  • Real World: They even tested it on a real robot arm in a real lab, and it worked just as well as in the computer simulation.

In a nutshell: HeRO gives robots a "3D brain" that understands not just what an object looks like, but what its parts are for, allowing them to manipulate objects with the same care and precision a human would use.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →