MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

This paper introduces MUGEN, a comprehensive benchmark revealing that Large Audio-Language Models struggle with multi-audio understanding as input scaling increases, and demonstrates that combining training-free strategies like Audio-Permutational Self-Consistency with Chain-of-Thought can significantly improve performance.

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are at a busy cocktail party. You aren't just listening to one person talking; you are trying to understand a conversation happening between three people, while also noticing the music playing in the background, the sound of clinking glasses, and the emotional tone of a joke being told.

This is the challenge that MUGEN (Multi-audio Grounding and Understanding Benchmark) was built to test.

Here is a simple breakdown of the paper, using everyday analogies:

1. The Problem: The "Single-Channel" Blind Spot

For a long time, AI models (Large Audio-Language Models, or LALMs) have been like excellent solo listeners. If you play them one song or one speech, they are great at understanding it. They can tell you if a singer is happy or if a speaker is angry.

But real life isn't a solo performance. It's a chaotic mix.

  • The Gap: Current AI is mostly tested on "solo" audio. The researchers realized that when you throw multiple sounds at these AIs at once (like a speaker talking over music, or two people arguing), the AI gets confused. It's like asking a person who is great at reading a book to suddenly read three books printed on top of each other simultaneously.

2. The Solution: The "MUGEN" Party Game

To fix this, the researchers created MUGEN, which is essentially a giant, difficult party game for AI.

  • The Setup: The AI is given a text instruction, like: "Find the clip where the speaker sounds the most angry."
  • The Twist: Instead of giving the AI one clip to analyze, they give it five different audio clips playing at once (or in a list).
  • The Goal: The AI has to listen to all five, compare them, and pick the one that fits the description.
  • The Difficulty: The game gets harder in two ways:
    1. More Players: They increase the number of audio clips from 2 up to 5.
    2. Tricky Clues: Some clues aren't about what is being said (the words), but how it sounds (the emotion, the accent, the background noise). This stops the AI from just "reading the transcript" and forces it to actually listen.

3. The Results: The AI Gets Overwhelmed

When they played this game with the smartest AI models available (including some from Google and open-source projects), they found some surprising weaknesses:

  • The "Crowded Room" Effect: As soon as they added more audio clips (going from 2 to 5), the AI's performance crashed. It's like trying to find a specific face in a crowd of 5 people; it's easy. Try finding that face in a crowd of 50, and you might miss it. The AI struggles to "scale up" its listening power.
  • The "Meaning vs. Feeling" Gap: The AIs were okay at understanding the words (semantics), but terrible at understanding the vibe (non-semantic stuff like emotion, pitch, or background noise). They could tell you what someone said, but they often failed to tell you how they felt when saying it, especially when other sounds were present.
  • The "Proprietary" Advantage: The most expensive, closed-source models (like Google's Gemini) did better than the free, open-source ones, but even they weren't perfect. They still got confused in the "crowded room."

4. The Fix: The "Shuffling" Trick

The researchers wanted to see if they could help the AI without retraining it (which is like trying to teach an adult a new language from scratch). They tried a clever trick called Audio-Permutational Self-Consistency (APSC).

  • The Analogy: Imagine you are trying to guess the winner of a race by looking at the runners. If you always look at them from left to right, you might get biased by who is standing first.
  • The Trick: The researchers told the AI: "Don't just look at the list of 5 audio clips once. Look at them in 10 different random orders."
    • Order 1: Clip A, B, C, D, E
    • Order 2: Clip C, A, E, B, D
    • ...and so on.
  • The Result: By shuffling the order, the AI stops relying on "position" (e.g., "I always pick the first one") and starts actually comparing the sounds. When they combined this shuffling with a "Chain of Thought" (asking the AI to think step-by-step), the accuracy jumped significantly.

The Big Takeaway

This paper is a wake-up call. It tells us that while our AI audio assistants are getting smarter, they are still bad at multitasking. They struggle when the world gets noisy and complex.

However, the researchers found a "cheat code": Shuffling the input order helps the AI focus better. This is a simple, free way to make current AI models much better at understanding the messy, multi-sound reality of our daily lives.

In short: We built a test to see if AI can handle a noisy party. It mostly failed, but we found a trick (shuffling the playlist) that helps it listen a little bit better.