OpenHEART: Opening Heterogeneous Articulated Objects with a Legged Manipulator

Imagine you have a robot dog with a robotic arm. It's a super-athlete, capable of walking over rough terrain and grabbing things. Now, imagine you ask this robot to open a bunch of different doors, drawers, and cabinets in a messy house.

Some doors swing on hinges (like a normal door), some drawers slide out (like a kitchen drawer), and some cabinets have weirdly shaped handles. The problem is that every single one of these objects is different. A robot trained to open a round doorknob might get confused by a long horizontal handle, or a sliding drawer might baffle a robot expecting a swinging door.

This paper, OpenHEART, introduces a new "brain" for this robot dog that lets it figure out how to open any of these weird, mixed-up objects without needing a specific manual for each one.

Here is how they did it, broken down into simple concepts:

1. The Problem: Too Much Noise, Not Enough Clarity

Most robots try to "see" the world using high-definition cameras or 3D scanners, creating a massive cloud of millions of dots (point clouds).

The Analogy: Imagine trying to learn how to drive a car by staring at a 4K video of the entire city, including every tree, cloud, and pedestrian. It's too much information! The robot gets overwhelmed and takes forever to learn the simple rule: "Turn the wheel to go left."
The Issue: Because legged robots (like the robot dog) are wobbly and complex, they need to learn fast. Looking at millions of dots is too slow and inefficient.

2. The Solution: The "Sketch" Method (SAFE)

The authors created a system called SAFE (Sampling-based Abstracted Feature Extraction). Instead of showing the robot the whole detailed object, they teach it to look at a simple "sketch."

The Analogy: Instead of giving the robot a photo of a complex door, they give it a simple box drawn around the handle and a box around the door panel. They ask the robot to ignore the wood grain, the color, and the scratches, and just focus on: "How long is the handle? Is the panel tall or wide?"
The Magic Trick: To make sure the robot doesn't just memorize the specific training doors, they randomly "shuffle" the dots inside these boxes. It's like giving the robot a puzzle where the pieces are always in a different order. This forces the robot to learn the shape of the object, not just the specific pattern of pixels. This makes the robot much better at handling objects it has never seen before.

3. The "Gut Feeling" Sensor (ArtIEst)

Sometimes, just looking at an object isn't enough. A cabinet might look like it opens left, right, or down, and the robot can't tell just by looking.

The Analogy: Imagine you are trying to open a jar. You look at it, but you aren't sure if the lid is stuck or if you're turning it the wrong way. So, you grab it and give it a little wiggle. Your hand (proprioception) tells you more than your eyes (exteroception) did.
How it Works: The robot has a smart system called ArtIEst.
- Phase 1 (Eyes): Before touching the object, it guesses the direction based on the "sketch" (SAFE).
- Phase 2 (Hands): Once it grabs the handle, it uses its own body sensors to feel the movement.
- The Switch: A "Belief Gate" acts like a traffic cop. If the robot is just looking, it trusts its eyes. As soon as it grabs the handle and starts feeling resistance, it switches to trusting its hands. This helps it correct mistakes instantly.

4. The Result: A Versatile Robot

The team tested this on a real robot dog with an arm.

The Test: They threw it a mix of revolute doors (swinging) and prismatic drawers (sliding) with all sorts of handle shapes.
The Outcome: The robot didn't need a new program for each object. It used one single "brain" to figure out: "Ah, this handle is long and horizontal, so I need to pull straight out. That other one is a knob on the side, so I need to push and turn."
Real World Success: They even tested it on a real drawer in a real room. The robot missed the handle on the first try, but instead of giving up, it realized, "Oops, I missed," grabbed it again, and successfully opened it.

Summary

OpenHEART is like teaching a robot to be a handyman. Instead of memorizing the instructions for every single door in the world, the robot learns to look at the shape of the handle and the feeling of the movement. It ignores the clutter, focuses on what matters, and adapts on the fly, making it a true master of opening things in a chaotic, real-world environment.

Here is a detailed technical summary of the paper "OpenHEART: Opening Heterogeneous Articulated Objects with a Legged Manipulator."

1. Problem Statement

The paper addresses the challenge of enabling legged manipulators (quadruped robots with arms) to autonomously open heterogeneous articulated objects (e.g., doors, drawers, cabinets) in unstructured environments.

Challenges:
- Object Diversity: Objects vary significantly in appearance, handle shapes, and articulation types (revolute vs. prismatic, different joint axes).
- Robot Dynamics: Legged manipulators have high degrees of freedom (DoF) and a floating base, making contact-rich manipulation tasks dynamically complex.
- Sample Inefficiency: Existing Reinforcement Learning (RL) approaches often rely on high-dimensional sensory inputs (e.g., raw point clouds, images). This leads to poor sample efficiency and difficulty in generalizing to new object types because the robot must learn complex dynamics alongside object-specific features.
- Estimation Ambiguity: Visual-only (exteroception) methods often fail with symmetric objects or ambiguous geometries, while proprioception-only methods require physical contact, which is too late for initial planning.

2. Methodology

The authors propose a hierarchical framework consisting of a High-level Planner and a Low-level Controller. The core innovations lie in the representation of object geometry and the estimation of articulation parameters.

A. Sampling-based Abstracted Feature Extraction (SAFE)

To improve sample efficiency and cross-domain generalization, the authors replace high-dimensional raw observations with a compact, low-dimensional representation.

Abstraction: Instead of using full point clouds, the handle and panel are abstracted into enveloping cuboids.
Feature Encoding: The relative lengths of the cuboid sides are preserved to encode handle shape (for grasping strategy) and panel dimensions (for opening direction).
Distributional Alignment: To prevent overfitting to specific training assets, the authors apply a sampling post-processing step. They randomly sample points from the interior of the cuboids and shuffle them.
- Theoretical Basis: Based on the Data Processing Inequality, this post-processing reduces the Kullback-Leibler (KL) divergence between the training and test distributions, forcing the policy to learn invariant geometric features rather than specific asset details.

B. Articulation Information Estimator (ArtIEst)

This module estimates the articulation information ( $\alpha_t$ ), defined as the vector indicating the opening direction and the range of motion (distance from handle to joint axis).

Hybrid Architecture: ArtIEst adaptively fuses two estimators via a Belief Gating Mechanism:
1. Exteroception-based Estimator: Uses geometric features (SAFE) of the handle and panel, plus robot orientation, to estimate $\alpha_t$ before contact.
2. Proprioception-augmented Estimator: Uses contact forces, joint states, and history to refine the estimate during manipulation. This resolves visual ambiguities (e.g., distinguishing between a left-opening vs. right-opening door when the handle is centered).
Belief Gate: A learnable module predicts a mixing ratio ( $\gamma_t$ ) to linearly interpolate between the two estimates. It shifts reliance from visual cues to proprioceptive cues as contact is established, minimizing estimation error.

C. Hierarchical Control

Low-level Controller: A pre-trained RL policy that tracks end-effector (EE) pose and base velocity commands using proprioception history.
High-level Planner: Trained with PPO to generate commands ( $c_t$ ) based on the low-dimensional observation (SAFE features, encoded proprioception history, and the mixed articulation estimate $\hat{\alpha}^{mix}_t$ ). It handles the complex task of approaching, grasping, and opening.

3. Key Contributions

First Autonomous Heterogeneous Manipulation: The first framework to enable a legged manipulator to open diverse articulated objects (revolute and prismatic) without precise object models.
SAFE Representation: A novel low-dimensional feature extraction method that reduces overfitting and improves generalization by abstracting object geometry into sampled cuboid points.
ArtIEst Architecture: A modular estimator that adaptively fuses vision and proprioception, outperforming monolithic fusion approaches by handling pre-contact ambiguity and post-contact refinement separately.
Robust Real-World Deployment: Successful demonstration on a real-world legged robot (Unitree Go2 + ViperX arm) with auto-retrying capabilities for unstable grasps.

4. Experimental Results

The framework was evaluated in simulation (Isaac Gym) and on a real-world robot.

Performance vs. Baselines:
- Reward: The proposed method achieved the highest opening reward compared to a "Center-based teacher" (using only handle position) and a "Point cloud-based policy" (using raw high-dimensional inputs).
- Convergence: The Point cloud-based policy showed slower convergence and lower performance, confirming that high-dimensional inputs are inefficient for contact-rich legged tasks.
- Saliency Maps: Analysis showed the proposed method focused correctly on the handle shape, whereas the point cloud policy focused on irrelevant object edges.
ArtIEst Evaluation:
- Estimation Error: The full ArtIEst method achieved the lowest estimation error (0.1701 rad) across the entire episode.
- Ablation: Removing the proprioception-augmented estimator increased error significantly (0.2482 rad), proving the necessity of contact-based refinement. The belief gating mechanism successfully reduced error during visual ambiguity.
Generalization (Cross-Domain):
- Success Rate: The method achieved a 79.02% success rate on the test set (unseen objects) and a 99.35% test/train ratio.
- SAFE Impact: Removing the sampling process (Ours w/o sampling) dropped the test/train ratio to 92.92%, confirming that the sampling step is crucial for generalization.
Real-World Demonstration:
- The robot successfully opened a cabinet (revolute) and a drawer (prismatic) with handles of different orientations.
- It demonstrated autonomous recovery: When an initial grasp was unstable, the robot autonomously regrasped and successfully opened the object, a behavior difficult to achieve with model-based approaches.

5. Significance

This work represents a significant step toward deploying legged robots in human-centric environments. By moving away from high-dimensional, sample-inefficient inputs toward abstracted, geometrically invariant representations, the authors demonstrate that legged manipulators can learn complex, contact-rich manipulation tasks with a single versatile policy. The ArtIEst architecture provides a robust solution to the "visual ambiguity" problem common in articulated object manipulation, bridging the gap between simulation and reality without requiring precise object models.