Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Imagine you are trying to understand a friend who is speaking to you through a very noisy window. You can hear their voice, but it's crackly and full of static. However, you can also see their lips moving clearly through the glass.

Most modern computers (AI models) designed to understand speech are like super-smart detectives. They have two main clues to solve the mystery of what was said: Audio (the sound) and Visual (the lip movements).

The paper you're asking about, "Dr. SHAP-AV," is essentially a forensic investigation into how these AI detectives use their two clues. The researchers wanted to answer a simple question: "When the noise gets loud, does the AI stop listening and start watching, or does it stubbornly keep trying to listen even when it can't?"

Here is the breakdown of their findings using simple analogies.

1. The Detective's Toolkit: "Shapley Values"

To figure out how much the AI relies on sound vs. sight, the researchers used a mathematical tool called Shapley Values.

The Analogy: Imagine a group of friends (the AI) trying to guess a secret word. Some friends have the audio clue, others have the video clue. To see who actually helped the most, you play a game where you remove one friend at a time. If you remove the "Audio Friend" and the group fails miserably, that friend was crucial. If you remove the "Video Friend" and the group still guesses fine, that friend wasn't doing much work.
The Result: Dr. SHAP-AV uses this game to calculate a "credit score" for the audio and the video for every single word the AI guesses.

2. The Big Discovery: The "Stubborn Ear"

The researchers tested six different AI models (some based on giant language models like Llama, others on specialized speech models) under different noise levels, from a quiet room to a rock concert.

What they expected: They thought that when the noise got really bad (like -10 dB), the AI would switch off its "ears" and rely 100% on its "eyes" (lip reading), just like a human would.
What they found: The AI is surprisingly stubborn. Even when the audio is terrible, the models still rely heavily on the sound (about 40-50% of the time).
The Metaphor: It's like a detective who is blindfolded in a hurricane but refuses to take off the blindfold, insisting, "I can still hear something!" even though the wind is screaming. The AI has a built-in bias toward audio, and it doesn't know how to let go of it easily, even when the audio is garbage.

3. The Dynamic Dance: How the AI Thinks While Speaking

The researchers also looked at when the AI uses these clues. Does it use them all at once, or does the strategy change as it types out the sentence?

The Finding: Some models (like Whisper-Flamingo) start by looking at the lips to get a foothold, but as they get further into the sentence, they start trusting the sound more again. It's like a hiker who starts by looking at the map (visual) to find the path, but once they know the terrain, they start listening to the birds (audio) to navigate.
The Contrast: Other models (like AV-HuBERT) are very consistent. They keep a perfect 50/50 balance the whole time, never wavering.

4. The "Time Travel" Check

The researchers checked if the AI was paying attention to the right moments. If the person says "Hello" at the start of the video, does the AI use the lip movement from the start of the video to guess "Hello"?

The Finding: Yes! Even in terrible noise, the AI maintains a perfect "time alignment." The early parts of the video match the early words, and the late parts match the late words. The AI isn't just guessing randomly; it's respecting the timeline of the conversation.

5. What Changes the AI's Mind?

The team tested many factors to see what makes the AI switch between listening and watching:

Type of Noise: Does it matter if the noise is music, traffic, or other people talking? Yes, but only a little.
How Hard the Sentence Is: If the sentence is very difficult to understand, does the AI switch strategies? No.
The Volume of Noise (SNR): This is the only thing that really matters. The louder the noise, the more the AI looks at the lips. But as we saw, it never stops listening completely.

The Takeaway: Why This Matters

The main conclusion is that current AI models are bad at knowing when to switch off the audio. They are "audio-biased." Even when the sound is useless, they keep trying to use it, which might actually hurt their performance.

The Solution? The authors suggest we need to build "smart switches" into these AIs. Just like a human would cover their ears and focus entirely on the lips in a loud bar, the AI needs a mechanism to say, "Okay, the audio is broken; I'm going to trust the video 100% now."

In short: Dr. SHAP-AV is a report card for AI detectives. It tells us that while they are getting better at using both clues, they are still too stubborn about listening to the noise, and they need to learn how to trust their eyes more when the ears are failing.

Here is a detailed technical summary of the paper "Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition."

1. Problem Statement

Audio-Visual Speech Recognition (AVSR) systems are designed to improve robustness against acoustic noise by leveraging visual cues (lip movements) alongside audio. However, the internal mechanisms by which these models balance and utilize audio versus visual modalities remain poorly understood.

The Bias: Current state-of-the-art models exhibit a strong "audio bias," relying heavily on audio even in clean conditions and failing to fully leverage visual information when audio is degraded.
The Gap: Existing research focuses on minimizing Word Error Rate (WER) but lacks a principled, mathematical framework to quantify how and when models rely on specific modalities during inference. Previous interpretability studies have been limited to specific models or lack comprehensive analysis across different acoustic conditions and generation stages.

2. Methodology: Dr. SHAP-AV

The authors propose Dr. SHAP-AV, a framework that applies Shapley values from cooperative game theory to analyze modality contributions in AVSR.

Core Concept

Shapley values provide a theoretically grounded method for fairly attributing a model's prediction to its input features. Unlike heuristic methods, Shapley values satisfy axioms of efficiency, symmetry, and fairness.

Implementation: The framework uses Permutation SHAP (and Sampling SHAP) to approximate Shapley values. It treats audio and visual tokens as "players" in a game where the "payoff" is the log-probability of generating a specific output token.
Adaptation: The authors extend Shapley attribution from standard vision-language encoders to autoregressive AVSR models, covering both LLM-based architectures (e.g., Llama-AVSR) and Cross-Attention Encoder-Decoder architectures (e.g., AV-HuBERT).
Masking Strategy: To compute contributions, audio or visual features are masked (set to zero) during inference. The framework handles architectural differences (e.g., masking before fusion in MLP-based models vs. masking cross-attention inputs in Flamingo-style models) and normalizes temporal resolution to ensure fair comparison.

Three Levels of Analysis

The framework introduces three distinct metrics to analyze contributions at different granularities:

Global SHAP: Aggregates contributions across all features and tokens to determine the overall audio vs. visual balance (A-SHAP vs. V-SHAP).
Generative SHAP: Tracks how modality reliance evolves during the autoregressive decoding process by grouping tokens into temporal windows.
Temporal Alignment SHAP: Examines the correspondence between input feature positions (early/late) and output token positions to verify if the model preserves sequential structure.

3. Experimental Setup

Models: Six state-of-the-art AVSR models were evaluated:
- LLM-based: Llama-AVSR, Llama-SMoP, Omni-AVSR.
- Encoder-Decoder: AV-HuBERT, Auto-AVSR, Whisper-Flamingo.
Datasets: LRS2 and LRS3 (BBC and TED talks).
Conditions: Experiments were conducted across varying Signal-to-Noise Ratios (SNR) from clean ( $\infty$ ) to severely degraded ( $-10$ dB) using various noise types (babble, music, environmental, speech).

4. Key Results & Findings

Finding 1: Adaptive Shift with Persistent Audio Bias

Observation: Models dynamically shift toward visual reliance as audio quality degrades (SNR drops).
Surprise: Even at severe noise levels ( $-10$ dB), models maintain surprisingly high audio contributions (38–46%), rather than switching almost entirely to vision.
Implication: The decoder's attention mechanism continues to attend heavily to audio features regardless of quality, suggesting a need for explicit mechanisms to force modality weighting based on input quality.

Finding 2: Dynamic Evolution During Generation

LLM Models (Whisper-Flamingo, Omni-AVSR): Show a U-shaped pattern in noisy conditions. They start with moderate audio reliance, dip to visual reliance in the middle of generation, and then increase audio reliance toward the end. This suggests that accumulated linguistic context helps the model "recover" the noisy audio signal.
Encoder-Decoder (AV-HuBERT): Maintains a stable balance throughout generation, likely due to its self-supervised pre-training on masked audio-visual prediction.

Finding 3: Robust Temporal Alignment

Observation: Both audio and visual streams maintain strong temporal alignment (early inputs contribute to early outputs, late to late).
Resilience: This alignment structure remains robust even under severe noise ( $-10$ dB), indicating that the models preserve the sequential correspondence of speech independently for both modalities.

Finding 4: Noise-Type Dependency

The degree of shift toward visual reliance depends on the type of noise. Babble noise (competing speech) induces the largest shift toward vision, while music or environmental noise induces smaller shifts.
Auto-AVSR (MLP fusion) showed almost no adaptation to SNR changes, maintaining a flat ~57% audio contribution, highlighting the rigidity of fixed-weight fusion compared to attention-based mechanisms.

Finding 5 & 6: Duration and Difficulty

Duration: The effect of utterance length on modality balance is architecture-dependent. For example, Whisper-Flamingo relies less on audio as utterances get longer (especially in noise), while AV-HuBERT relies slightly more on audio for longer sequences in noise.
Difficulty: SNR is the dominant factor driving modality balance. Recognition difficulty (measured by WER) has minimal effect on modality weighting within a specific SNR level. Models do not adapt their weighting based on how "hard" the specific utterance is to recognize, only on the noise level.

5. Significance and Contributions

New Diagnostic Standard: Dr. SHAP-AV establishes Shapley-based attribution as a standard diagnostic tool for AVSR, moving beyond simple WER metrics to understand how models work.
Architectural Insights: The study reveals that attention-based mechanisms (Transformers, LLMs) are capable of dynamic modality adaptation, whereas fixed-weight fusion (MLP) is rigid and fails to adapt to noise.
Future Directions: The findings motivate the development of explicit modality-weighting mechanisms that can dynamically adjust reliance based on input quality (SNR) rather than relying on the model's implicit (and often biased) behavior.
Generalizability: The framework is applicable to both LLM-based and traditional encoder-decoder AVSR architectures, providing a unified view of multimodal integration.

In conclusion, the paper demonstrates that while AVSR models possess the capacity to adapt to noise, they suffer from a persistent audio bias that prevents optimal use of visual cues in severe conditions. Dr. SHAP-AV provides the tools necessary to diagnose and eventually correct these imbalances.