SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

This paper introduces SCENEBench, a comprehensive benchmark suite designed to evaluate Large Audio Language Models on critical non-speech and cross-lingual audio understanding tasks relevant to assistive and industrial applications, revealing significant performance gaps and latency challenges in current state-of-the-art models.

Laya Iyer, Angelina Wang, Sanmi Koyejo

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you have a very smart robot friend who can listen to the world. You ask it, "What did that person say?" and it replies perfectly. But what if you ask, "What else is happening in this room?" or "Is that siren getting closer or moving away?" or "Is that person coughing because they are sick or just clearing their throat?"

Most of today's super-smart AI listening robots are like obsessive note-takers. They are incredible at writing down exactly what words were spoken (like a court stenographer), but they often miss the rest of the story. They might tell you a person said "Hello," but they completely ignore the fact that a dog was barking, a car was screeching, or the person was whispering in fear.

This paper introduces a new "report card" called SCENEBench to test if these AI robots can actually understand a scene, not just transcribe words.

Here is a breakdown of the paper using simple analogies:

1. The Problem: The "Blind Note-Taker"

Current AI models are like a student who is so focused on copying the teacher's lecture that they don't notice the fire alarm going off, the window breaking, or the teacher sounding sad.

  • The Old Tests: Previous tests only asked, "Did you get the words right?"
  • The New Reality: In the real world (like helping a deaf person or monitoring a factory), knowing what is said is less important than knowing what is happening around it.

2. The New Test: SCENEBench

The authors created a four-part obstacle course to see if AI can handle real-life chaos. Think of it as a "Driver's Ed" test for AI ears.

Task A: The Background Noise Detective

  • The Scenario: Imagine a person talking at a busy coffee shop.
  • The Test: Can the AI hear the person and the espresso machine grinding, the clinking of cups, and the rain outside?
  • The Result: The AI is great at hearing the person. But if you don't specifically ask, "What else do you hear?", the AI usually ignores the background completely. It's like a student who only writes down the teacher's voice and misses the fire drill.

Task B: The "Siren Sense" (Noise Localization)

  • The Scenario: A siren is wailing. Is it getting closer (danger!) or moving away (safety)?
  • The Test: The AI has to listen to how the volume changes to guess the direction.
  • The Result: The AI is terrible at guessing this on its own. It's like a person trying to guess which way a car is driving just by listening to the engine, but only if they are told to look for that specific clue. If you ask directly, "Is it coming or going?", they get better, but they still struggle with complex movements (like a siren passing by and swinging back and forth).

Task C: The Polyglot Translator

  • The Scenario: A person is speaking in English but suddenly switches to Spanish, then Hindi, then back to English.
  • The Test: Can the AI write down the whole sentence, keeping the different languages intact?
  • The Result: The AI often gets confused. Instead of writing "I have a fifteen-day vacation quiero un viaje," it just rewrites the whole thing in English, erasing the foreign words. It's like a translator who refuses to let you speak your native tongue, forcing everything into English even when you didn't ask.

Task D: The "Vibe" Checker (Vocal Characterizers)

  • The Scenario: A person is speaking, but they also cough, laugh, sneeze, or whisper.
  • The Test: Can the AI tell the difference between a cough and a laugh, even if the words are the same?
  • The Result: The AI is actually pretty good at this! It can tell when someone is laughing or coughing. However, it sometimes gets confused between similar sounds (like a yawn vs. a sigh).

3. The Big Surprise: The "Synthetic" vs. "Real" Gap

The researchers built these tests using "fake" audio (mixing two recordings together on a computer). They were worried this wouldn't reflect real life.

  • The Analogy: It's like practicing driving in a video game and then taking the test on a real highway.
  • The Finding: Surprisingly, the results were similar. The AI that did well in the "video game" (synthetic data) also did well in the "real highway" (real recordings). This means the test is a fair way to judge the AI's true abilities.

4. The Verdict: Why Does This Matter?

The paper concludes that current AI is optimized for the "What," not the "How" or the "Where."

  • Why? Because most training data is "clean." It's like teaching a chef only how to cook perfect, isolated ingredients, but never how to cook a messy, noisy family dinner.
  • The Fix: We need to train these AIs on messy, real-world audio where background noise, movement, and mixed languages happen naturally.

The Takeaway

SCENEBench is a wake-up call. It tells us that while our AI listening robots are brilliant at being stenographers, they are still clumsy at being context-aware observers.

If we want AI to help deaf people hear approaching cars or help factories detect dangerous machine noises, we can't just teach them to read words. We have to teach them to listen to the whole room. This new benchmark is the first step in making sure they actually learn that skill.