KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding

This paper introduces KPM-Bench, a novel open-source dataset with fine-grained motion annotations and a hallucination evaluation set, alongside the MoPE algorithm and a GRPO-based training framework, to systematically address the limitations of current video captioning models in accurately describing complex human motions and mitigating motion-related hallucinations.

Boda Lin, Yongjie Zhu, Xiaocheng Gong, Wenyu Qin, Meng Wang

Published 2026-02-23
📖 4 min read☕ Coffee break read

Imagine you are watching a dance performance. A standard video description might say, "A woman is dancing in a garden." It's true, but it's like describing a symphony by just saying, "Music is playing." It misses the magic: the way her left arm sweeps up, the rhythm of her feet, the specific bend of her knee.

Current AI models are great at spotting the "music" (the general scene), but they often struggle to describe the "dance moves" in detail. Worse, they sometimes start hallucinating—inventing moves that never happened, like saying she spun around when she actually just stepped forward.

This paper introduces KPM-Bench, a new toolkit designed to teach AI to become a choreographer, not just a spectator. Here's how they did it, broken down into simple concepts:

1. The "Robot Coach" (Kinematic Parsing)

Instead of just asking an AI to "look and guess," the researchers gave the AI a robot coach.

  • How it works: Before the AI writes a single word, they run the video through a system that maps the human skeleton (like a stick-figure drawing) and calculates the physics of the movement.
  • The Analogy: Imagine a sports analyst who doesn't just watch the game but has a stopwatch and a protractor for every single joint. The system measures: How fast did that arm move? What was the angle of that knee? Was the movement rhythmic or jerky?
  • The Result: The AI isn't guessing; it's reading a "physics report" of the dance. It knows exactly when the left leg lifted and how fast the right hand swung.

2. The "Grammar of Motion" (PaMoR)

Once the robot coach has the numbers, the AI needs to turn them into a story. The researchers created a special language structure called PaMoR (Parsing-based Motion Representation).

  • The Analogy: Think of this as a Lego instruction manual for describing movement. Instead of a messy paragraph, the AI breaks the dance down into specific blocks:
    • Who is moving? (The dancer)
    • What are they doing? (Raising an arm)
    • How are they doing it? (Slowly, gracefully)
    • Where are they going? (Upward)
  • By forcing the AI to fill in these "Lego blocks" first, the final description becomes structured, detailed, and impossible to get wrong about the basic facts.

3. The "Hallucination Police" (MoPE)

Even with a robot coach, AI can still get creative in the wrong way. To fix this, they built a Hallucination Police algorithm called MoPE.

  • The Analogy: Imagine a strict editor who reads the AI's story and checks it against the "physics report" line-by-line.
    • AI says: "She spun three times."
    • MoPE checks the report: "Wait, the report says she only turned 90 degrees."
    • MoPE says: "Delete that! That's a lie."
  • They used this police officer to train the AI using a special reward system (GRPO). Every time the AI told the truth about the motion, it got a gold star. Every time it made up a move, it lost points. Eventually, the AI learned that accuracy is the only way to win.

4. The New Playground (KPM-Bench)

Finally, they built a massive new playground called KPM-Bench to test these skills.

  • It contains 75,000 videos with incredibly detailed descriptions (not just "dancing," but "raising the left arm while bending the right knee").
  • It includes 38,000 tricky questions like, "Did the dancer lift her left or right hand first?"
  • It has a special "Hallucination Test" to see if the AI invents fake moves.

Why Does This Matter?

Think of current video AI as a tourist who takes a blurry photo and says, "I saw a park."
This new method turns the AI into a biomechanics expert who can tell you, "The person in the red shirt walked 3 meters to the right, then bent their left knee at a 45-degree angle while swinging their right arm."

This is huge for:

  • Sports Analysis: Coaching athletes by breaking down their exact movements.
  • Robotics: Teaching robots to move like humans by understanding the physics of motion.
  • Accessibility: Giving detailed, accurate descriptions for the visually impaired that go beyond "a person is walking."

In short, KPM-Bench teaches AI to stop guessing and start measuring, turning vague descriptions into precise, trustworthy stories about human movement.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →