Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

Imagine you are trying to teach a robot to understand movies. You show it thousands of films, but every time you play a clip, you mute the sound. The robot learns to guess what's happening based only on what it sees.

This is exactly what has been happening in the world of "Video-LLMs" (AI models that watch videos). Even though we have incredibly smart AI that can understand speech and sounds (like a human listening to a lecture), most video-AI systems ignore the audio completely.

This paper asks a simple, provocative question: "Do modern video-AIs actually need to listen?"

Here is the breakdown of their findings, explained with some everyday analogies.

1. The "Silent Movie" Trap

The authors discovered that the "tests" (benchmarks) we use to grade these AI models are rigged. They are like a driving test where the instructor never asks you to listen to the engine or the radio.

The Analogy: Imagine a test where you have to guess the plot of a movie, but you are only allowed to look at one single frozen frame from the middle of the film.
The Shocking Result: The researchers found that for many popular tests, an AI could get 77% of the answers right just by looking at that one silent picture. It didn't need to hear the dialogue or the music at all!
The Problem: Because the tests don't require listening, the AI developers never bother to teach their models how to listen. They treat the audio track as optional "extra baggage" and throw it away to save space.

2. The "Heavy Suitcase" Problem

So, why don't we just add the audio back in?
The authors explain that audio is a data monster.

The Analogy: If a video is a suitcase, the pictures are a few shirts. The audio, however, is like a suitcase filled with 90,000 tiny pebbles (tokens) for just one hour of video.
The Bottleneck: If you try to feed all those pebbles to the AI, it gets overwhelmed. It takes too long to process, and the AI gets "tired" (high latency). This is why current models like Qwen2.5-Omni are slow; they try to carry the whole heavy suitcase.

3. The "Compression" Solution

The team built a new system that acts like a smart translator.
Instead of feeding the AI every single pebble, they built a "compressor" that summarizes the audio.

The Analogy: Imagine you have a 1-hour lecture. Instead of transcribing every single word (90,000 words), your smart assistant listens and writes down just the 3,600 most important sentences (1 sentence per second).
The Magic: They tested five different types of "summarizers." The winner was a design based on Mamba (a type of AI architecture). It acts like a causal filter: it only looks at what has happened so far (like a real-time stream), which is crucial for live applications. It shrinks the audio data by 25 times without losing the meaning.

4. The Big Reveal: When Listening Actually Matters

Once they fixed the tests (by removing the "easy" questions that could be answered with just a picture) and added the compressed audio, the results were clear:

For "Visual" Tasks: If the question is "How many people are in the room?" or "What color is the car?", adding audio doesn't help. The AI is already good at seeing.
For "Listening" Tasks: If the question is "Who is speaking the quietest?" or "What did the person say in the background?", the audio is essential.
- On these specific tasks, the AI with the "listening" module scored significantly higher.
- Without the audio, the AI was essentially guessing. With the audio, it knew the answer.

5. The Takeaway

The paper concludes with a simple truth: Modern Video-AIs do need to listen, but only if we stop testing them on silent movies.

The Current State: We are building cars with great engines but no steering wheels because our driving tests only check if the car can drive in a straight line.
The Future: By fixing the tests to include sound and using a smart "compression" technique to handle the data, we can build AI that truly understands the world—both what it sees and what it hears.

In short: The technology to make video-AI "hear" has been there all along. We just needed to stop ignoring the soundtrack and build a better way to carry it. The authors have open-sourced their code so everyone can start building these "listening" robots today.

1. Problem Statement

Despite the maturity of speech and audio encoders (e.g., Whisper, Qwen2-Audio), modern Video Large Language Models (Video-LLMs) routinely exclude audio streams, defaulting to "Video (w/o Audio) → Text" pipelines. The authors identify two primary structural issues driving this trend:

Benchmark Bias: Widely used video understanding benchmarks (e.g., ActivityNetQA, NExTQA, TempCompass) are heavily biased toward visual cues. They often do not require listening to answer questions, leading to a self-reinforcing cycle where models are never trained or evaluated on audio.
Visual Shortcuts: Even benchmarks marketed as "Audio-Visual" (e.g., AVQA) suffer from severe visual shortcuts. The authors demonstrate that a significant portion of these questions can be answered correctly using only a single, muted frame, rendering audio evaluation ineffective.
Scalability Bottleneck: Audio front-ends generate tokens at high frequencies (25–50 Hz). For a one-hour video, this results in ~90,000 audio tokens, which saturates context windows and causes high latency (e.g., 4.1s per sample for uncompressed audio in Qwen2.5-Omni), making long-form video understanding impractical without compression.

2. Methodology

A. Benchmark Audit Protocol

To quantify the reliance on visual cues, the authors introduced a Single-Frame Filtering Protocol:

Procedure: They fed only the temporally central frame of a video (no audio, no other frames) to GPT-4o. If the model answered correctly in two independent runs with different temperatures, the item was flagged as "solvable from a single muted frame."
Goal: To create "filtered" evaluation splits that remove visual shortcuts, ensuring that remaining items genuinely require cross-modal reasoning (listening).

B. Audio-Visual Modeling Architecture

Building on LLaVA-OneVision, the authors integrated audio processing with the following components:

Encoders:
- Vision: SigLIP2 (576 tokens per frame).
- Audio: Qwen2-Audio's Whisper-based model, converting raw waveforms to log-Mel spectrograms and pooling to 25 Hz.
Input Strategies: They compared three token ordering policies:
1. Visual-Only: Vision tokens only.
2. Non-Interleaving: All vision tokens followed by all audio tokens ([V; A]).
3. Time-Aligned Interleaving: Audio tokens placed adjacent to their corresponding temporal video frames.
Token Compression: To address the token bottleneck, they designed a Periodic-Query Compression module inserted between the audio encoder and the LLM.
- Mechanism: A shared trainable query $q$ is inserted every $R$ tokens (stride). The sequence passes through a compression network, and only the outputs at query positions are retained.
- Compression Ratio: Achieved a 25× reduction (from 25 Hz to 1 Hz), reducing 90K tokens/hour to ~3.6K tokens/hour.
- Architectures Tested: Five compressor designs were compared:
  1. Avg Pool: Parameter-free average pooling + MLP.
  2. Resampler: Transformer cross-attention with learnable queries.
  3. UniMamba: Causal (unidirectional) State Space Model (SSM).
  4. BiMamba: Bidirectional SSM.
  5. UniMambaMia: Adapted from MambaMia, using a causal Mamba backbone with a gated attention mechanism to re-weight compressed tokens.

3. Key Contributions

Benchmark Audit: Audited 10 major video benchmarks, revealing that up to 77–80% of items in popular suites (like AVQA and TempCompass) can be solved via visual shortcuts alone. They released filtered evaluation splits to enable fairer assessment of audio-visual capabilities.
Controlled Integration Study: Demonstrated that when visual shortcuts are removed, audio provides clear performance gains on tasks requiring speech comprehension or cross-modal grounding.
Scalable Compression: Identified UniMambaMia (a causal Mamba-based design) as the most stable and effective compressor for audio, outperforming bidirectional models and simple pooling while enabling streaming inference (a critical requirement for real-time applications).

4. Results

Benchmark Performance (Filtered vs. Unfiltered)

Unfiltered: Adding audio yielded marginal gains on many benchmarks, confirming that most items were visual-only.
Filtered (Shortcuts Removed):
- Audio produced significant gains on 5 out of 10 benchmarks.
- Top Gains: AVSpeakerBench (+3.0 pp), WorldSense (+2.5 pp), VideoMME (+2.3 pp), LongVideoBench (+2.2 pp), and AVQA (+1.4 pp).
- Vision-Centric Suites: Benchmarks like ActivityNetQA and Music-AVQA showed little to no improvement or slight declines, confirming they do not genuinely test audio reasoning.

Compressor Comparison

Learnable vs. Simple: Learnable compressors (Mamba variants, Resampler) consistently outperformed the parameter-free Avg Pool baseline.
Causal vs. Bidirectional: UniMamba (causal) performed comparably to or better than BiMamba (bidirectional). This suggests that for 1-D, inherently sequential audio, future context (bidirectional) offers limited advantage over causal processing.
Best Model: UniMambaMia achieved the best or tied-best scores on 4 of 6 filtered benchmarks and was selected as the final architecture.

Comparison with State-of-the-Art

The proposed model (Qwen2-7B backbone + Audio) achieved the best or tied-best results on 7 of 10 benchmarks among Qwen2-7B models.
Latency Efficiency: The model achieved a latency of 1.60s per sample (VideoMME) by compressing audio to ~3.6K tokens. In contrast, Qwen2.5-Omni (which uses uncompressed audio) had a latency of 4.12s, highlighting the practical necessity of compression.

5. Significance and Conclusion

The paper concludes that modern Video-LLMs do need to listen, but current benchmarks fail to measure this capability accurately due to visual shortcuts.

Paradigm Shift: The authors argue that the community must move away from "muted" evaluations. Once shortcuts are controlled for, audio is essential for tasks involving speech comprehension and cross-modal grounding.
Scalable Solution: The proposed time-aligned interleaving with causal compression offers a practical, scalable recipe for integrating audio into long-form video understanding without sacrificing visual performance or incurring prohibitive latency.
Open Source: The authors commit to open-sourcing their code, models, and the newly created filtered evaluation splits to foster more rigorous audio-visual research.

Key Takeaway: The exclusion of audio from Video-LLMs is not due to a lack of utility, but a failure of benchmarks to demand it. With proper benchmarking and efficient compression (25×), audio significantly enhances video understanding capabilities.