Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

This paper reveals that current video benchmarks often fail to evaluate audio-visual reasoning due to over-reliance on visual cues, but demonstrates that integrating speech encoders with efficient token compression significantly improves performance on tasks requiring speech comprehension and cross-modal grounding.

Geewook Kim, Minjoon Seo

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to understand movies. You show it thousands of films, but every time you play a clip, you mute the sound. The robot learns to guess what's happening based only on what it sees.

This is exactly what has been happening in the world of "Video-LLMs" (AI models that watch videos). Even though we have incredibly smart AI that can understand speech and sounds (like a human listening to a lecture), most video-AI systems ignore the audio completely.

This paper asks a simple, provocative question: "Do modern video-AIs actually need to listen?"

Here is the breakdown of their findings, explained with some everyday analogies.

1. The "Silent Movie" Trap

The authors discovered that the "tests" (benchmarks) we use to grade these AI models are rigged. They are like a driving test where the instructor never asks you to listen to the engine or the radio.

  • The Analogy: Imagine a test where you have to guess the plot of a movie, but you are only allowed to look at one single frozen frame from the middle of the film.
  • The Shocking Result: The researchers found that for many popular tests, an AI could get 77% of the answers right just by looking at that one silent picture. It didn't need to hear the dialogue or the music at all!
  • The Problem: Because the tests don't require listening, the AI developers never bother to teach their models how to listen. They treat the audio track as optional "extra baggage" and throw it away to save space.

2. The "Heavy Suitcase" Problem

So, why don't we just add the audio back in?
The authors explain that audio is a data monster.

  • The Analogy: If a video is a suitcase, the pictures are a few shirts. The audio, however, is like a suitcase filled with 90,000 tiny pebbles (tokens) for just one hour of video.
  • The Bottleneck: If you try to feed all those pebbles to the AI, it gets overwhelmed. It takes too long to process, and the AI gets "tired" (high latency). This is why current models like Qwen2.5-Omni are slow; they try to carry the whole heavy suitcase.

3. The "Compression" Solution

The team built a new system that acts like a smart translator.
Instead of feeding the AI every single pebble, they built a "compressor" that summarizes the audio.

  • The Analogy: Imagine you have a 1-hour lecture. Instead of transcribing every single word (90,000 words), your smart assistant listens and writes down just the 3,600 most important sentences (1 sentence per second).
  • The Magic: They tested five different types of "summarizers." The winner was a design based on Mamba (a type of AI architecture). It acts like a causal filter: it only looks at what has happened so far (like a real-time stream), which is crucial for live applications. It shrinks the audio data by 25 times without losing the meaning.

4. The Big Reveal: When Listening Actually Matters

Once they fixed the tests (by removing the "easy" questions that could be answered with just a picture) and added the compressed audio, the results were clear:

  • For "Visual" Tasks: If the question is "How many people are in the room?" or "What color is the car?", adding audio doesn't help. The AI is already good at seeing.
  • For "Listening" Tasks: If the question is "Who is speaking the quietest?" or "What did the person say in the background?", the audio is essential.
    • On these specific tasks, the AI with the "listening" module scored significantly higher.
    • Without the audio, the AI was essentially guessing. With the audio, it knew the answer.

5. The Takeaway

The paper concludes with a simple truth: Modern Video-AIs do need to listen, but only if we stop testing them on silent movies.

  • The Current State: We are building cars with great engines but no steering wheels because our driving tests only check if the car can drive in a straight line.
  • The Future: By fixing the tests to include sound and using a smart "compression" technique to handle the data, we can build AI that truly understands the world—both what it sees and what it hears.

In short: The technology to make video-AI "hear" has been there all along. We just needed to stop ignoring the soundtrack and build a better way to carry it. The authors have open-sourced their code so everyone can start building these "listening" robots today.