Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

This paper introduces Dr. SHAP-AV, a Shapley value-based framework that reveals audio-visual speech recognition models exhibit a persistent audio bias and dynamically shift toward visual reliance under noise, providing critical insights for developing robust modality-weighting mechanisms.

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

Published Fri, 13 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to understand a friend who is speaking to you through a very noisy window. You can hear their voice, but it's crackly and full of static. However, you can also see their lips moving clearly through the glass.

Most modern computers (AI models) designed to understand speech are like super-smart detectives. They have two main clues to solve the mystery of what was said: Audio (the sound) and Visual (the lip movements).

The paper you're asking about, "Dr. SHAP-AV," is essentially a forensic investigation into how these AI detectives use their two clues. The researchers wanted to answer a simple question: "When the noise gets loud, does the AI stop listening and start watching, or does it stubbornly keep trying to listen even when it can't?"

Here is the breakdown of their findings using simple analogies.

1. The Detective's Toolkit: "Shapley Values"

To figure out how much the AI relies on sound vs. sight, the researchers used a mathematical tool called Shapley Values.

  • The Analogy: Imagine a group of friends (the AI) trying to guess a secret word. Some friends have the audio clue, others have the video clue. To see who actually helped the most, you play a game where you remove one friend at a time. If you remove the "Audio Friend" and the group fails miserably, that friend was crucial. If you remove the "Video Friend" and the group still guesses fine, that friend wasn't doing much work.
  • The Result: Dr. SHAP-AV uses this game to calculate a "credit score" for the audio and the video for every single word the AI guesses.

2. The Big Discovery: The "Stubborn Ear"

The researchers tested six different AI models (some based on giant language models like Llama, others on specialized speech models) under different noise levels, from a quiet room to a rock concert.

  • What they expected: They thought that when the noise got really bad (like -10 dB), the AI would switch off its "ears" and rely 100% on its "eyes" (lip reading), just like a human would.
  • What they found: The AI is surprisingly stubborn. Even when the audio is terrible, the models still rely heavily on the sound (about 40-50% of the time).
  • The Metaphor: It's like a detective who is blindfolded in a hurricane but refuses to take off the blindfold, insisting, "I can still hear something!" even though the wind is screaming. The AI has a built-in bias toward audio, and it doesn't know how to let go of it easily, even when the audio is garbage.

3. The Dynamic Dance: How the AI Thinks While Speaking

The researchers also looked at when the AI uses these clues. Does it use them all at once, or does the strategy change as it types out the sentence?

  • The Finding: Some models (like Whisper-Flamingo) start by looking at the lips to get a foothold, but as they get further into the sentence, they start trusting the sound more again. It's like a hiker who starts by looking at the map (visual) to find the path, but once they know the terrain, they start listening to the birds (audio) to navigate.
  • The Contrast: Other models (like AV-HuBERT) are very consistent. They keep a perfect 50/50 balance the whole time, never wavering.

4. The "Time Travel" Check

The researchers checked if the AI was paying attention to the right moments. If the person says "Hello" at the start of the video, does the AI use the lip movement from the start of the video to guess "Hello"?

  • The Finding: Yes! Even in terrible noise, the AI maintains a perfect "time alignment." The early parts of the video match the early words, and the late parts match the late words. The AI isn't just guessing randomly; it's respecting the timeline of the conversation.

5. What Changes the AI's Mind?

The team tested many factors to see what makes the AI switch between listening and watching:

  • Type of Noise: Does it matter if the noise is music, traffic, or other people talking? Yes, but only a little.
  • How Hard the Sentence Is: If the sentence is very difficult to understand, does the AI switch strategies? No.
  • The Volume of Noise (SNR): This is the only thing that really matters. The louder the noise, the more the AI looks at the lips. But as we saw, it never stops listening completely.

The Takeaway: Why This Matters

The main conclusion is that current AI models are bad at knowing when to switch off the audio. They are "audio-biased." Even when the sound is useless, they keep trying to use it, which might actually hurt their performance.

The Solution? The authors suggest we need to build "smart switches" into these AIs. Just like a human would cover their ears and focus entirely on the lips in a loud bar, the AI needs a mechanism to say, "Okay, the audio is broken; I'm going to trust the video 100% now."

In short: Dr. SHAP-AV is a report card for AI detectives. It tells us that while they are getting better at using both clues, they are still too stubborn about listening to the noise, and they need to learn how to trust their eyes more when the ears are failing.