KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding

Imagine you are watching a dance performance. A standard video description might say, "A woman is dancing in a garden." It's true, but it's like describing a symphony by just saying, "Music is playing." It misses the magic: the way her left arm sweeps up, the rhythm of her feet, the specific bend of her knee.

Current AI models are great at spotting the "music" (the general scene), but they often struggle to describe the "dance moves" in detail. Worse, they sometimes start hallucinating—inventing moves that never happened, like saying she spun around when she actually just stepped forward.

This paper introduces KPM-Bench, a new toolkit designed to teach AI to become a choreographer, not just a spectator. Here's how they did it, broken down into simple concepts:

1. The "Robot Coach" (Kinematic Parsing)

Instead of just asking an AI to "look and guess," the researchers gave the AI a robot coach.

How it works: Before the AI writes a single word, they run the video through a system that maps the human skeleton (like a stick-figure drawing) and calculates the physics of the movement.
The Analogy: Imagine a sports analyst who doesn't just watch the game but has a stopwatch and a protractor for every single joint. The system measures: How fast did that arm move? What was the angle of that knee? Was the movement rhythmic or jerky?
The Result: The AI isn't guessing; it's reading a "physics report" of the dance. It knows exactly when the left leg lifted and how fast the right hand swung.

2. The "Grammar of Motion" (PaMoR)

Once the robot coach has the numbers, the AI needs to turn them into a story. The researchers created a special language structure called PaMoR (Parsing-based Motion Representation).

The Analogy: Think of this as a Lego instruction manual for describing movement. Instead of a messy paragraph, the AI breaks the dance down into specific blocks:
- Who is moving? (The dancer)
- What are they doing? (Raising an arm)
- How are they doing it? (Slowly, gracefully)
- Where are they going? (Upward)
By forcing the AI to fill in these "Lego blocks" first, the final description becomes structured, detailed, and impossible to get wrong about the basic facts.

3. The "Hallucination Police" (MoPE)

Even with a robot coach, AI can still get creative in the wrong way. To fix this, they built a Hallucination Police algorithm called MoPE.

The Analogy: Imagine a strict editor who reads the AI's story and checks it against the "physics report" line-by-line.
- AI says: "She spun three times."
- MoPE checks the report: "Wait, the report says she only turned 90 degrees."
- MoPE says: "Delete that! That's a lie."
They used this police officer to train the AI using a special reward system (GRPO). Every time the AI told the truth about the motion, it got a gold star. Every time it made up a move, it lost points. Eventually, the AI learned that accuracy is the only way to win.

4. The New Playground (KPM-Bench)

Finally, they built a massive new playground called KPM-Bench to test these skills.

It contains 75,000 videos with incredibly detailed descriptions (not just "dancing," but "raising the left arm while bending the right knee").
It includes 38,000 tricky questions like, "Did the dancer lift her left or right hand first?"
It has a special "Hallucination Test" to see if the AI invents fake moves.

Why Does This Matter?

Think of current video AI as a tourist who takes a blurry photo and says, "I saw a park."
This new method turns the AI into a biomechanics expert who can tell you, "The person in the red shirt walked 3 meters to the right, then bent their left knee at a 45-degree angle while swinging their right arm."

This is huge for:

Sports Analysis: Coaching athletes by breaking down their exact movements.
Robotics: Teaching robots to move like humans by understanding the physics of motion.
Accessibility: Giving detailed, accurate descriptions for the visually impaired that go beyond "a person is walking."

In short, KPM-Bench teaches AI to stop guessing and start measuring, turning vague descriptions into precise, trustworthy stories about human movement.

1. Problem Statement

Current Vision-Language Models (VLMs) struggle with fine-grained motion understanding in video captioning. While they perform well on high-level content (e.g., "a person is dancing"), they fail to accurately describe intricate limb-level dynamics, temporal sequences, and specific kinematic attributes. This leads to two critical issues:

Lack of Detail: Complex actions are often reduced to broad summaries rather than structured, limb-level analyses.
Hallucinations: Models frequently generate non-existent or inaccurate motion details (e.g., wrong limb movements, incorrect temporal order, or wrong direction), undermining reliability.
Data Scarcity: Existing datasets with fine-grained annotations (like MotionBench) rely on costly manual labeling, limiting their scale. Conversely, automated generation often lacks the physical grounding required for accuracy.

2. Methodology

The authors propose a comprehensive solution involving a new dataset construction pipeline, a novel evaluation algorithm, and a specialized training framework.

A. KPM-Bench Dataset Construction Pipeline

The core innovation is an automated annotation pipeline that integrates physical kinematics with linguistic parsing to generate dense, fine-grained captions.

Kinematic Decomposition:
- Pose Estimation: Uses RTMPose3D to extract 3D whole-body skeletal points (COCO-Wholebody 133 points).
- Time-Domain Analysis: Calculates translational velocity ( $v$ ) and angular velocity ( $\omega$ ) of joints based on Chasles' Theorem and Screw Theory.
- Frequency-Domain Analysis: Applies Fast Fourier Transform (FFT) to velocity and angular velocity signals to capture rhythmic variations and distinguish between vigorous and subtle movements.
Linguistic Parsing (PaMoR):
- Introduces PaMoR (Parsing-based Motion Representation), a structured schema inspired by Case Grammar and Motion Event Theory.
- Hierarchical Levels: Decomposes motion into Individual-level (global trajectory), Limb-level (coordinated segments), and Distal-level (finger/toe articulations).
- Attributes: Defines 8 core attributes including Motion Predicate, Agent/Patient Entities, Magnitude, Direction, and Qualifiers.
Generation Process:
- Kinematic numerical data and video frames are fed into GPT-4.1 via a multi-stage prompting strategy.
- First, GPT generates structured PaMoR-Tuples (the parsed motion events).
- Second, these tuples are combined with video frames to generate Dense Captions that narrate the motion process chronologically and physically.

B. Hallucination Mitigation: MoPE & GRPO

To address hallucinations, the paper introduces:

MoPE (Motion Parsing and Extraction): A linguistically grounded algorithm that extracts motion attributes from text using Abstract Meaning Representation (AMR) for semantics and Dependency Parsing (DP) for syntax. It identifies actions, agents, objects, directions, and temporal order.
GRPO Training: The authors integrate MoPE into the Group Relative Policy Optimization (GRPO) framework.
- A composite reward function ( $R$ $R$ ) is designed with three components:
  - $R_{action}$ : F1-score of action concepts.
  - $R_{order}$ : Accuracy of pairwise temporal ordering.
  - $R_{direction}$ : Alignment of directional attributes.
- This reward guides the model to generate captions that are factually consistent with the visual motion.

C. Evaluation Metrics

Mo-Hall: A new metric that uses the MoPE algorithm to precisely count motion hallucinations (missing actions, wrong order, wrong direction) without relying on external LLMs as judges.
Standard Metrics: BLEU, ROUGE, BERT-Score, and GPT-based scores for general quality.

3. Key Contributions

KPM-Bench: A large-scale, open-source benchmark containing:
- 75k video-caption pairs with dense, limb-level motion descriptions.
- 38k complex Question-Answer (QA) pairs covering motion attributes, interactions, and temporal logic.
- 215 videos specifically curated for hallucination evaluation (KPM-HA).
Automated Kinematic Pipeline: A novel method to generate high-quality fine-grained annotations by fusing physical motion computation (velocity, FFT) with linguistic structuring (PaMoR), bypassing the need for massive manual labeling.
MoPE Algorithm & GRPO Integration: The first linguistically driven method to extract motion attributes for precise hallucination detection, successfully integrated into a post-training framework to significantly reduce hallucinations.
New Evaluation Standard: A hallucination metric independent of large-scale VLMs/LLMs, providing a more reliable measure of factual consistency in motion descriptions.

4. Experimental Results

Caption Quality: The KPM-trained model (7B parameters) significantly outperforms closed-source APIs (GPT-4.1, Gemini-2.5Pro) and other open-source VLMs (InternVideo2.5, Tarsier2-Recap) on KPM-Cap. It achieves higher scores in GPT-Score (3.46 vs 3.23 for GPT-4.1) and BLEU-4 (14.77 vs 9.99).
Hallucination Reduction: The model trained with MoPE (KPM) shows a marked reduction in hallucinations compared to the version without it (KPM w/o MoPE). Specifically, it lowers the Mo-Hall score (fewer action hallucinations) while maintaining high linguistic quality.
QA Performance: On the KPM-QA benchmark, the model achieves 94.05% accuracy, surpassing GPT-4.1 (84.41%) and the best open-source baseline (InternVideo-2.5 at 85.95%).
Generalization: The model demonstrates strong transferability, performing competitively on external benchmarks like MVBench, FAVOR, and MotionBench, indicating robust generalization capabilities.

5. Significance

This work bridges the gap between physical kinematics and natural language generation in video understanding.

Scientific Impact: It proves that integrating physical motion analysis (pose, velocity, frequency) into the data generation pipeline yields significantly more accurate and detailed motion descriptions than purely text-based or generic VLM approaches.
Practical Application: The reduction in hallucinations and the ability to describe complex, limb-level actions make this technology highly valuable for applications requiring precise motion understanding, such as sports analytics, physical therapy/rehabilitation monitoring, and human-robot interaction.
Community Resource: By releasing KPM-Bench and the MoPE algorithm, the authors provide the community with the first large-scale, fine-grained motion benchmark and a robust tool for evaluating motion hallucinations, setting a new standard for future research in video captioning.