Original authors: Kyungmin Lee, Sibeen Kim, Youngdo Lee, Minho Park, Hyunseung Kim, Dongyoon Hwang, Donghu Kim, Hojoon Lee, Jaegul Choo

Published 2026-06-05

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Kyungmin Lee, Sibeen Kim, Youngdo Lee, Minho Park, Hyunseung Kim, Dongyoon Hwang, Donghu Kim, Hojoon Lee, Jaegul Choo

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you want to teach a robot to dance or walk like a human. The old way was to hire expensive actors in a studio with special suits to record their movements (Motion Capture). This is great quality, but it's like having a library with only a few books on "walking" and "reaching." You can't teach a robot to do complex, fun stuff if you don't have the data.

Recently, scientists tried a new trick: they grabbed millions of videos from the internet and used AI to turn them into robot instructions. This is like having a massive library with millions of books, but the problem is that the books are full of typos, missing pages, and impossible physics. If you teach a robot with these "bad books," the robot might try to walk through walls, float in the air like a ghost, or slide its feet across the floor like it's on ice.

Enter PHUMA.

The researchers at KAIST created a new dataset called PHUMA (Physically Reliable HUMAnoid locomotion dataset). Think of PHUMA as a "Quality Control Department" for robot training data. They took that massive pile of internet videos and ran them through a two-step filter to make them safe and realistic for robots.

Here is how they did it, using simple analogies:

Step 1: The "Bouncer" (Physics-Aware Curation)

Before the data even gets to the robot, PHUMA acts like a strict bouncer at a club.

The Problem: Internet videos are messy. Sometimes the camera moves weirdly, making a person look like they are floating or sinking into the floor. Sometimes a person is sitting on a chair, but the robot doesn't have a chair to sit on.
The Fix: PHUMA scans the videos and throws out the "bad" clips. If a person looks like they are floating, sliding, or doing something the robot physically can't do (like sitting on thin air), that clip gets deleted. They also smooth out the "jittery" movements, like fixing a shaky video camera.
The Result: They kept the best 73 hours of motion from a huge pool of data, ensuring every single clip is physically possible.

Step 2: The "Tailor" (PhySINK Retargeting)

Even if the video is good, a human body is shaped differently than a robot body. A human has knees that bend one way; a robot might have joints that bend another way.

The Problem: If you just copy-paste a human's pose onto a robot, the robot might twist its joints until they break (joint violation) or try to walk with its feet halfway through the floor (penetration).
The Fix: The researchers built a special tool called PhySINK. Imagine a master tailor who doesn't just stretch a suit to fit a new person; they actually re-sew the seams to fit the new body's shape while making sure the fabric doesn't rip.
- The "No-Floating" Rule: The tailor makes sure the robot's feet actually touch the ground.
- The "No-Skating" Rule: The tailor ensures the robot doesn't slide its feet like a hockey player when it's supposed to be standing still.
- The "No-Breaking" Rule: The tailor checks that the robot's joints don't bend in ways that would snap them.

The Results: Why It Matters

The paper tested this new dataset on real robots (specifically the Unitree G1 and H1-2).

The Competition: They compared PHUMA against the old "small but perfect" datasets (like AMASS) and the "huge but messy" internet datasets (like Humanoid-X).
The Winner: The robots trained with PHUMA were the best at everything.
- They succeeded more often at copying new movements they had never seen before.
- They didn't fall over or glitch out as much.
- Real-World Test: When they put the robot trained on PHUMA into the real world (not just a computer simulation), it walked much smoother and made fewer mistakes than robots trained on the other datasets.

The Bottom Line

The paper argues that for robots to move naturally, quality matters more than just quantity. You can have a million bad videos, but if they contain impossible physics, the robot will learn to fail. PHUMA proves that by carefully filtering internet videos and using a "smart tailor" to fix the robot's body shape, you can create a massive, high-quality library of movements that makes robots walk, turn, and balance just like humans.

What the paper does not claim:

It does not claim this will help robots do surgery or interact with complex objects (like opening a fridge) yet. The focus is strictly on locomotion (walking, turning, balancing).
It does not claim the robots are perfect; they still have some small errors, but they are significantly better than before.
It does not claim this works on uneven ground (like stairs or rocks) yet; it focuses on flat surfaces.

Technical Summary: PHUMA - Physically Reliable Humanoid Locomotion Dataset

1. Problem Statement

Humanoid robots require reliable and natural locomotion to function as general-purpose embodied AI. While reinforcement learning (RL) with task-oriented rewards has succeeded in quadrupeds, applying it to humanoids often results in gaits that are effective but lack humanlike coordination. Motion imitation, which trains policies to replicate human movements, has emerged as a promising solution. However, this paradigm is fundamentally constrained by the scale, diversity, and physical feasibility of available human motion data.

Existing datasets face a trade-off:

High-quality Motion Capture (e.g., AMASS, LaFAN1): These provide physically feasible motions but are limited in scale and diversity, often dominated by simple actions like walking and reaching.
Large-scale Internet Video (e.g., Humanoid-X): Recent methods convert vast amounts of internet video into motion data to increase scale. However, these pipelines suffer from severe physical artifacts, including:
- Global translation errors: Causing "floating" (feet above ground) or "penetration" (feet below ground).
- Retargeting failures: Prioritizing joint alignment over physical plausibility leads to joint violations and foot skating.
- Instability: Artifacts like root jitter or motions requiring external objects (e.g., sitting on chairs) that do not exist in the robot's environment.

These artifacts hinder stable imitation learning, as policies trained on physically invalid data struggle to transfer to real-world hardware.

2. Methodology

The authors propose PHUMA (Physically Reliable HUMAnoid locomotion dataset), constructed via a two-stage pipeline designed to scale internet video while enforcing physical constraints.

2.1 Physics-Aware Motion Curation

Before retargeting, raw motion data (derived from Humanoid-X and other sources) undergoes a filtering process to remove physically invalid sequences:

Jitter and Instability Removal: A low-pass Butterworth filter smooths high-frequency jitter. Sequences with excessive jerk or a Center of Mass (CoM) falling outside the base of support (indicating instability, such as sitting on a non-existent chair) are discarded.
Ground Contact Correction: Since video-derived motions often lack a fixed ground reference, the pipeline estimates a global ground plane using majority voting on foot mesh vertex heights.
Clip-Level Segmentation: Instead of discarding entire sequences, the pipeline segments data into 4-second clips. Clips exhibiting excessive jerk, CoM instability, or insufficient foot-ground contact (floating) are removed, preserving valid segments within otherwise flawed sequences.
Data Augmentation: Curated video data is combined with high-quality motion capture data from LaFAN1, LocoMuJoCo, and proprietary video captures.

2.2 Physics-Constrained Motion Retargeting (PhySINK)

To address artifacts introduced during the mapping of human motion to the robot's morphology, the authors introduce PhySINK (Physically constrained Shape-adaptive Inverse Kinematics). This method augments standard Shape-adaptive Inverse Kinematics (SINK) with four specific loss terms optimized jointly:

Motion Fidelity Loss ( $\mathcal{L}_{Fidelity}$ ): Minimizes per-joint position and per-link orientation errors to preserve the source motion style.
Joint Feasibility Loss ( $\mathcal{L}_{Feasibility}$ ): Penalizes joint angles and velocities that exceed the robot's mechanical limits, preventing over-extension.
Grounding Loss ( $\mathcal{L}_{Ground}$ ): Aligns foot height with the estimated ground plane during contact frames, eliminating floating and penetration.
Skating Loss ( $\mathcal{L}_{Skate}$ ): Suppresses horizontal foot velocity during contact frames to prevent foot sliding.

The total objective is:
$\mathcal{L}_{PhySINK} = \mathcal{L}_{Fidelity} + w_{Feasibility}\mathcal{L}_{Feasibility} + w_{Ground}\mathcal{L}_{Ground} + w_{Skate}\mathcal{L}_{Skate}$

3. Key Contributions

PHUMA Dataset: A 73-hour corpus of physically reliable humanoid locomotion data, aggregating motion capture and curated internet videos. It represents a 3.5x increase in scale over AMASS and exceeds the usable duration of Humanoid-X (69.1h out of 231.4h) in terms of physical feasibility.
PhySINK Algorithm: A novel retargeting method that explicitly enforces joint limits, ground contact, and anti-skating constraints, significantly reducing physical artifacts compared to standard IK, GMR, and SINK methods.
Physics-Aware Curation Pipeline: A systematic approach to filtering and correcting video-derived motions, demonstrating that data quality (via curation) outweighs raw quantity.

4. Experimental Results

The authors evaluated PHUMA across four axes using the Unitree G1 and H1-2 humanoids in simulation (via Masked-Mimic) and real-world deployment (via BeyondMimic).

Impact of Curation: Ablation studies showed that applying physics-aware filters to Humanoid-X data significantly improved downstream motion tracking success rates, even when the dataset size was reduced by ~75%.
Retargeting Performance: PhySINK outperformed IK, GMR, and SINK across all five physical reliability metrics (Motion Fidelity, Joint Feasibility, Non-Floating, Non-Penetration, Non-Skating). Notably, PhySINK was the only method to achieve high scores in both fidelity and physical constraints simultaneously.
Dataset Comparison: Policies trained on PHUMA achieved the highest success rates across all motion categories (stationary, angular, vertical, horizontal) compared to policies trained on LaFAN1, AMASS, and Humanoid-X.
- On unseen motions, PHUMA-trained policies achieved a 92.7% success rate in simulation, compared to 76.2% for AMASS and 50.6% for Humanoid-X.
Real-World Transfer: In zero-shot deployment on a real Unitree G1, the PHUMA-trained policy demonstrated a 16.3% lower tracking error (MPJPE) compared to the AMASS-trained policy. Qualitative results confirmed more faithful motion tracking and fewer physical violations.

5. Significance and Claims

The paper claims that the primary bottleneck in humanoid motion imitation is not merely the quantity of data, but its physical reliability. While previous large-scale datasets (like Humanoid-X) suffer from artifacts that prevent stable learning, and high-quality datasets (like AMASS) lack diversity, PHUMA successfully bridges this gap.

The authors conclude that:

Data Quality is Paramount: Filtering and correcting physical violations in training data yields better policy performance than simply increasing data volume.
Retargeting Matters: Enforcing physical constraints during the retargeting stage (via PhySINK) is critical for generating training data that leads to stable, real-world locomotion.
Scalability with Reliability: It is possible to scale humanoid motion datasets using internet videos without sacrificing physical plausibility, provided a rigorous curation and constraint-based retargeting pipeline is employed.

The work establishes that advancing humanoid motion tracking depends on having physically reliable data, a standard that PHUMA aims to set for future research in embodied AI.

PHUMA: Physically Reliable Humanoid Locomotion Dataset