LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

To address the gap in evaluating omnimodal large language models on long-form content, this paper introduces LVOmniBench, a new benchmark comprising 275 videos (10–90 minutes) and 1,014 QA pairs that reveals current models struggle with extended audio-visual comprehension, achieving below 35% accuracy for open-source models and up to 65% for Gemini 3 Pro.

Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang

Published 2026-03-20
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot how to understand the world. So far, we've taught it to look at a single photo or listen to a 10-second clip of a song. It's doing pretty well at those short tasks. But real life isn't a series of snapshots; it's a continuous, flowing movie that can last for hours, filled with people talking, music playing, and things happening in the background.

This paper introduces LVOmniBench, a new "final exam" designed to test if our smartest AI robots can actually handle these long, complex movies.

Here is the breakdown of what they did, using some simple analogies:

1. The Problem: The "Short-Clip" Trap

Until now, most AI tests were like showing a student a 10-second video of a cat jumping and asking, "Did the cat jump?" The AI could easily memorize that specific jump.

But in the real world, videos are like 90-minute documentaries or vlogs. They have:

  • Long-term memory: "What did that guy say 20 minutes ago?"
  • Complex timing: "When did the music stop and the rain start?"
  • Mixed signals: People talking over music while doing something with their hands.

Current AI models are like students who can memorize flashcards but fail a 3-hour final exam. They get lost when the story gets too long.

2. The Solution: LVOmniBench (The "Marathon" Test)

The researchers built a new test called LVOmniBench. Think of it as a marathon instead of a sprint.

  • The Course: They collected 275 real-world videos (like cooking shows, travel vlogs, and interviews) that are 10 to 90 minutes long.
  • The Questions: They wrote over 1,000 questions that require the AI to use both its eyes (video) and ears (audio) at the same time.
    • Example: "The man in the video says he is looking for 'Toby.' How many times does he actually see Toby in the yard?"
    • Why it's hard: The AI has to listen to the name, watch the video to find the dog, count the sightings, and ignore the times the dog isn't there.

3. The Results: The "Rich Kid" vs. The "Struggling Student"

The researchers tested the best AI models on this marathon. The results were a bit shocking:

  • The "Proprietary" Models (The Rich Kids): These are the super-expensive, closed-source models like Gemini 3 Pro. They are like students with private tutors and unlimited study time. They scored around 65%. They did well, but they still got about a third of the questions wrong.
  • The "Open-Source" Models (The Struggling Students): These are the free, community-built models. They scored below 35%.
    • The Metaphor: If you guess randomly on a multiple-choice test, you'd get 25% right. These open-source models were barely doing better than random guessing. They got completely lost in the long videos.

4. Where Did They Fail? (The "Hallucination" Zone)

The paper found that the AI fails in three specific ways, like a student who is bad at:

  • Counting: "How many times did the dog bark?" (The AI often loses count).
  • Music Perception: "What kind of instrument is playing?" (The AI hears noise but can't identify the melody).
  • Time Travel: "What happened 15 minutes ago?" (The AI forgets the beginning of the video by the time it reaches the end).

5. The Big Takeaway

The main message is that we are not there yet.

While AI is amazing at looking at a picture or listening to a short sentence, it is currently terrible at understanding a long, continuous story where sound and sight are mixed together. The "open-source" models are essentially blind and deaf to the nuances of long videos.

Why does this matter?
Because the future of AI isn't just about answering questions about a photo. It's about having an AI assistant that can watch a 2-hour security camera feed to find a thief, or listen to a 45-minute lecture and summarize the key points. This new test (LVOmniBench) is the first step in forcing AI developers to build models that can actually handle the length and complexity of real life.

In short: We built a harder test, and the AI failed it. Now, we know exactly what to fix to make the next generation of AI truly "omni-smart."

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →