Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots

Imagine you are at a party, and someone hands you a video of a dancer. You can't see their face, their clothes, or their skin—only a floating skeleton made of glowing lines. Your job is to guess: Is that a real human dancing, or is it a robot trying to look human?

If you can't tell the difference, the robot has passed the "Motion Turing Test."

This paper introduces exactly that test, along with a massive new dataset and a smart tool to help robots get better at it. Here's the breakdown in plain English:

1. The Big Idea: The "Skeleton" Test

For a long time, people have judged robots by how they look (do they have a human face? soft skin?). But this paper says, "Let's ignore the costume."

The researchers created a test where they strip away all the robot's "flesh" (metal shells, joints, wires) and turn every video into a SMPL-X skeleton. This is a 3D wireframe model that looks exactly the same whether it's a human or a robot.

The Goal: If a human judge looks at the wireframe and can't tell if it's a person or a machine, the robot wins.
The Reality Check: The researchers found that even though robots are getting great, they still look "clunky" when they move fast or do complex things like boxing or jumping. Humans can spot the difference easily.

2. The Dataset: The "Human-Humanoid Motion" (HHMotion) Library

To run this test, they needed a huge library of videos. They built HHMotion, which is like a massive training gym for AI.

The Content: They collected 1,000 video clips. Some are real humans, some are real robots (from big competitions like the World Robot Conference), and some are robots in computer simulations.
The Variety: They covered 15 different activities, from standing still and walking to running, dancing, and even fighting.
The Human Element: They hired 30 people to watch every single clip and rate it on a scale of 0 to 5.
- 0: "This is definitely a robot. It moves like a stiff tin can."
- 5: "I have no idea. That could be a human."
The Effort: This took over 500 hours of human watching and grading.

3. The Discovery: Robots Are Good at Walking, Bad at Boxing

After analyzing the scores, the team found some interesting patterns:

The "Smooth" Wins: Robots are surprisingly good at simple, rhythmic things like walking or standing. These motions are repetitive and easy to program.
The "Chaos" Loses: Robots struggle with dynamic, high-energy moves like jumping, boxing, or playing ping-pong. These require quick, fluid adjustments and balance that robots just don't have yet. They tend to look jerky or stiff.
The Simulation Gap: Robots in computer simulations looked more "human" than real robots. This suggests that real-world physics (friction, balance, weight) is still a huge hurdle for engineers.

4. The New Tool: PTR-Net (The Robot Coach)

The researchers tried using fancy, giant AI models (like the ones that write essays or chat with you) to grade these robot movements. It didn't work well. These big models are great at reading text but bad at understanding the subtle "feel" of a moving skeleton.

So, they built a simpler, specialized tool called PTR-Net.

What it does: It looks at the motion data and predicts a score (0–5) just like a human would.
Why it's better: It's like a specialized coach who only cares about movement mechanics. It outperformed the giant "chatbot" style AI models.
The Future: This tool can be used to train robots. Instead of just telling a robot "don't fall," you can tell it, "Move more like a human, aim for a score of 4.5!"

5. The "Uncanny Valley" Twist

One of the coolest parts of the study was when they asked humans to pretend to be robots.

They asked people to move stiffly and mechanically, mimicking how a robot moves.
The Result: The humans were so good at faking it that even the AI and human judges got confused! Sometimes, a human pretending to be a robot got the same score as a real robot.
The Lesson: This proves that "human-likeness" isn't just about smoothness; it's also about the intent and natural adaptability that humans have but robots lack.

Summary

This paper is a wake-up call for the robotics world. We've built robots that can walk and dance, but they still haven't mastered the "soul" of human movement. By creating a strict test (the Motion Turing Test) and a new scoring tool (PTR-Net), the authors are giving engineers a clear roadmap to build robots that don't just look human, but move like us.

In short: Robots are getting better, but they still move like they're walking on ice while humans are dancing on a trampoline. This new test helps them learn how to stop slipping and start dancing.

1. Problem Statement

While humanoid robots have made significant strides in motion generation and control, there is a lack of a standardized, quantitative framework to evaluate how "human-like" their movements truly appear to human observers. Existing metrics often focus on task completion (e.g., stability, trajectory accuracy) rather than perceptual naturalness. Furthermore, current evaluations are often biased by visual appearance (e.g., metallic shells, exposed joints), making it difficult to isolate the quality of the motion itself. The authors propose the Motion Turing Test: a framework where a human evaluator cannot distinguish between a robot's motion and a human's motion based solely on kinematic data.

2. Methodology

A. The HHMotion Dataset

To enable this evaluation, the authors constructed the Human-Humanoid Motion (HHMotion) dataset:

Scale & Composition: Contains 1,000 motion clips (5 seconds each) totaling 21.7 hours of data.
Sources:
- Humanoid Robots: 11 different models (e.g., Unitree G1, ENGINEAI PM01) from real-world events (WRC, WAIC, WHRG) and simulated environments.
- Humans: 10 human subjects performing the same 15 action categories (e.g., walking, boxing, jumping, dancing) as the robots, plus additional YouTube clips and a subset where humans intentionally mimic robotic rigidity.
Preprocessing: To eliminate visual bias, all raw RGB videos were converted into SMPL-X parametric body models. This strips away texture and appearance, leaving only skeletal pose and motion data for evaluation.
Annotation: 30 human annotators scored each clip on a 0–5 Likert scale (0 = "completely robotic," 5 = "indistinguishable from human"). This resulted in over 500 hours of annotation effort.

B. The Evaluation Task

The paper formulates a new task: Motion Human-Likeness Prediction. Given a sequence of SMPL-X poses, a model must predict a continuous score (0–5) representing the degree of human-likeness, aligning with human annotator judgments.

C. Proposed Baseline: PTR-Net

The authors propose Pose-Temporal Regression Network (PTR-Net) as a simple yet effective baseline:

Architecture:
1. Temporal Encoder: A bidirectional LSTM to capture long-range temporal dependencies.
2. Spatial-Temporal Graph Convolution (ST-GCN): Processes the body as a graph to extract coordination patterns across joints and frames. It uses a parameter-free adjacency design for adaptive aggregation.
3. Attention Pooling & Regression: A temporal attention module highlights salient motion segments, followed by a lightweight MLP regressor to output the final score.
Training Objective: Optimized using L2 regression loss plus a regularization term ( $L_{reg}$ ) to penalize excessive temporal fluctuations, ensuring smooth score predictions.

3. Key Contributions

Motion Turing Test Framework: A novel evaluation paradigm focusing exclusively on kinematic human-likeness, decoupled from visual appearance.
HHMotion Dataset: The first large-scale dataset specifically designed to rate human-likeness, featuring paired human and robot motions across 15 categories, 11 robot models, and diverse environments (real/simulated).
PTR-Net: A specialized regression model that outperforms state-of-the-art Vision-Language Models (VLMs) in predicting motion human-likeness scores.
Comprehensive Analysis: Detailed insights into which motion categories robots struggle with (dynamic actions) versus where they succeed (cyclic actions).

4. Key Results

A. Dataset Analysis

Performance Gap: Despite recent advances, a significant gap remains between human and robot motions. Humans can easily distinguish them even without visual cues.
Action Categories:
- Largest Gaps: Dynamic, high-frequency actions like jumping, boxing, and running showed the lowest human-likeness scores for robots (e.g., Boxing: Human 3.76 vs. Robot 1.23).
- Smallest Gaps: Smooth, cyclic actions like walking and standing showed closer alignment (e.g., Walking: Human 3.92 vs. Robot 2.61).
Simulation vs. Reality: Simulated robot motions generally received higher human-likeness scores than real-world robot motions, highlighting the "reality gap."

B. Model Performance

PTR-Net vs. VLMs: PTR-Net significantly outperformed large multimodal models (Gemini 2.5 Pro, Qwen3-VL-Plus) even when those models were used with advanced prompting strategies (Chain-of-Thought, few-shot learning).
- PTR-Net: MAE: 0.58, RMSE: 0.79, Spearman's $\rho$ : 0.68.
- Best VLM (Gemini 2.5 Pro PA-CoT): MAE: 1.27, RMSE: 1.52, Spearman's $\rho$ : 0.23.
Ablation Study: Removing the temporal encoder or attention pooling significantly degraded performance, confirming the necessity of capturing both spatial coordination and temporal context.
Generalization: PTR-Net demonstrated strong Out-of-Distribution (OOD) capabilities, accurately predicting scores for unseen robots like the XPeng IRON (predicted 4.25 vs. human avg 4.36).

C. Human Imitation

The study found that when humans intentionally mimic robotic rigidity, the kinematic cues become ambiguous, and both humans and robots receive similar low scores. This suggests that "human-likeness" involves not just smoothness but also the underlying intentionality and adaptability of human motor control.

5. Significance

Benchmarking: Provides the community with the first rigorous, human-centered benchmark for evaluating humanoid motion quality, moving beyond task-oriented metrics.
RL & Generation: The dataset and PTR-Net can serve as a reward model for Reinforcement Learning (RL) and a loss function for motion generation models, guiding robots to learn more natural movements.
Limitations of LLMs: The results challenge the assumption that large multimodal models are sufficient for fine-grained physical motion analysis, highlighting the need for specialized architectures for kinematic evaluation.
Open Science: The authors commit to releasing the dataset, code, and benchmark to accelerate research in humanoid robotics.