UniVBench: Towards Unified Evaluation for Video Foundation Models

The paper introduces UniVBench, a comprehensive benchmark featuring 200 high-quality, human-created multi-shot videos and a unified agentic evaluation system (UniV-Eval) to holistically assess video foundation models across understanding, generation, editing, and reconstruction tasks, addressing the limitations of existing fragmented and task-specific evaluations.

Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, Zuozhu Liu

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are a film director trying to hire a new assistant. You have three very different jobs for them:

  1. The Critic: They need to watch a movie and write a perfect review describing every detail.
  2. The Creator: They need to take a written description and magically conjure a movie out of thin air.
  3. The Editor: They need to take an existing movie and change specific things (like the weather or the actor's clothes) without breaking the scene.

Until now, we've been hiring assistants for just one of these jobs at a time. We have "Critic" tests, "Creator" tests, and "Editor" tests. But the new generation of AI models wants to be a "Super Assistant" that can do all three at once. The problem? We didn't have a single test to see if this Super Assistant was actually good at everything, or if it was just a "one-trick pony."

Enter UniVBench.

Think of UniVBench as the ultimate "All-Star Sports Day" for video AI. Instead of testing the athlete's running, swimming, and jumping separately, this benchmark throws them into a complex, multi-stage obstacle course that requires all those skills simultaneously.

Here is how it works, broken down into simple concepts:

1. The Obstacle Course (The Dataset)

Most previous tests used short, simple clips or movies ripped from the internet (which can be legally messy). UniVBench is different.

  • The Content: The team created 200 brand-new, high-quality videos from scratch. These aren't just 5-second loops; they are like short movie scenes with multiple "shots" (camera angles) and a real story.
  • The Instructions: For every video, they wrote a detailed "script" (what the video looks like), a "director's note" (how to edit it), and even "reference photos" (what the characters should look like).
  • The New Challenge: They added a new task called Video Reconstruction. Imagine showing the AI a video, asking it to describe it in words, and then asking it to re-make the video based only on those words. If the AI forgets a detail in the description, the new video will be wrong. This tests if the AI truly "understands" what it sees.

2. The Referee (UniV-Eval)

In the past, grading these videos was messy.

  • The Old Way: It was like a teacher giving a student a single grade of "75/100" for an essay. You know they passed, but you don't know why they failed. Did they have bad grammar? Was the story boring? Was the spelling wrong?
  • The UniVBench Way: They built a Smart Referee System (called UniV-Eval). Instead of one number, this referee breaks the performance down into 21 specific categories, like:
    • Did the lighting look right?
    • Did the camera move smoothly?
    • Did the characters keep their clothes on correctly?
    • Was the background consistent?

This system acts like a film critic with a magnifying glass. It doesn't just say "Good job." It says, "The lighting was perfect, but the character's hand disappeared for a second, and the background changed color when it shouldn't have."

3. The Results (What We Learned)

When they ran this "All-Star" test on current AI models, the results were eye-opening:

  • Specialists vs. Generalists: Some AIs are amazing at making videos but terrible at understanding them (like a painter who can't explain art). Others are great at watching videos but can't create them.
  • The "Action" Problem: Almost every model struggled with movement. If you ask an AI to make a video of a dog running, it often makes the dog slide or freeze. It's like a puppeteer who can draw a perfect dog but can't make the puppet's legs move naturally.
  • The Gap: No single model currently exists that is perfect at understanding, creating, and editing all at once. UniVBench proves that while we are getting closer, the "Super Assistant" isn't quite ready for prime time yet.

Why Does This Matter?

Think of video AI as a new language. Before, we were teaching students to read, write, and speak separately. UniVBench is the first fluency test that checks if they can hold a real conversation, tell a story, and react to questions all at the same time.

By providing a fair, detailed, and comprehensive way to measure these models, UniVBench gives developers a clear map of where they are failing. It's the difference between saying, "Your car is fast," and saying, "Your car is fast, but the brakes squeak, the AC is broken, and the GPS is wrong."

In short: UniVBench is the first "Report Card" that actually tells us if our video AI is a true genius or just a one-trick pony, helping us build the next generation of machines that can truly see, understand, and create the world around us.