MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities

Imagine you are trying to understand a person's mood by watching a video of them talking. To get the full picture, you need three things:

What they say (Text)
How they sound (Audio)
What their face looks like (Video)

In the real world, things rarely work perfectly. Sometimes the microphone breaks (no audio), sometimes the camera is blocked (no video), or sometimes the speech-to-text software fails (no text). This is called having "missing modalities."

For a long time, computer scientists built AI models to handle these missing pieces. They tested these models by randomly deleting data, assuming that if the microphone fails, the camera might fail too, and vice versa. They treated all three senses as equally likely to break.

But in reality, that's not how it works.

In a noisy factory, the audio might be missing 90% of the time, but the video is always clear.
In a privacy-focused chat app, the video might be blocked, but the text is always there.
This is called Imbalanced Missing Rates (IMR). Some senses are "fragile" and break often; others are "tough" and rarely break.

The Problem: The "Lazy Student" Effect

The authors of this paper, MissBench, discovered a hidden problem. When they trained AI models with these realistic, uneven missing rates, the models started acting like a lazy student who only studies one subject.

If the Text is always available but the Audio is missing half the time, the AI learns to ignore the Audio completely. It leans 100% on the Text because that's the only reliable source it has.

The Result: The AI might still get the right answer (high accuracy), but it has become "unfair" to the other senses. It has forgotten how to use them.
The Danger: If you suddenly put that AI in a situation where the Text is missing (but Audio is there), the AI crashes because it never learned to listen.

The Solution: MissBench (The New Report Card)

The authors created a new testing framework called MissBench. Think of it as a new, stricter report card for AI models that doesn't just ask, "Did you get the right answer?" but also asks, "Did you use all your senses fairly?"

They introduced two new ways to grade the AI:

1. The "Fairness Score" (Modality Equity Index - MEI)

Imagine a group project where three students (Text, Audio, Video) are working together.

High Score: Everyone contributes equally. If one person leaves, the others step up.
Low Score: One student does all the work while the others sit on the couch.
MissBench's Finding: Many AI models that look great on standard tests actually have a Low Fairness Score. They rely too heavily on one sense (usually Text) and ignore the others, especially when data is missing unevenly.

2. The "Learning Balance Score" (Modality Learning Index - MLI)

This looks at how the AI learns. Imagine the AI is a chef trying to learn a recipe.

Balanced Learning: The chef tastes the salt, the pepper, and the garlic equally to adjust the flavor.
Imbalanced Learning: The chef only tastes the salt because it's the only spice available. The brain stops paying attention to the pepper and garlic.
MissBench's Finding: Under uneven conditions, the AI's "brain" (its internal math) gets hijacked by the dominant sense. It stops updating its knowledge about the missing senses, making it brittle.

Why This Matters

The paper shows that if we only look at the final score (Accuracy), we are being fooled. We might think an AI is "robust" and ready for the real world, but it's actually just a "Text-only" model wearing a disguise.

MissBench forces developers to build models that are truly robust—models that can handle a broken microphone, a blocked camera, or a missing transcript without panicking, because they have learned to value and use all their senses, even when some are missing more often than others.

In short: MissBench is a stress test that ensures AI doesn't just get the right answer by cheating (relying on one sense), but actually learns to be a well-rounded, multi-sensory thinker.

Here is a detailed technical summary of the paper "MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities."

1. Problem Statement

Multimodal affective computing (e.g., sentiment analysis, emotion recognition) typically assumes that textual, acoustic, and visual modalities are equally available. However, in real-world applications (e.g., noisy environments, sensor failures, privacy constraints), modalities are often missing at imbalanced rates.

The Gap: Existing benchmarks and evaluation protocols predominantly focus on Shared Missing Rates (SMR), where all modalities have the same probability of being missing. They rarely account for Imbalanced Missing Rates (IMR), where specific modalities (e.g., audio) fail more frequently than others (e.g., text).
The Consequence: Under IMR, models may appear robust based on standard task-level metrics (Accuracy, F1-score) but suffer from hidden modality inequity (one modality dominates the prediction) and optimization imbalance (gradient updates are driven disproportionately by the most available modality). Current evaluations fail to diagnose these internal failures.

2. Methodology: MissBench Framework

The authors introduce MissBench, a comprehensive benchmark and framework designed to standardize the evaluation of multimodal models under both SMR and IMR conditions.

A. Datasets and Protocols

MissBench standardizes four widely used affective datasets:

IEMOCAP (Emotion Recognition)
CMU-MOSI, CMU-MOSEI (Sentiment Analysis)
CH-SIMS (Chinese Sentiment Analysis)

It defines two masking protocols:

Shared Missing Rate (SMR): All modalities share a uniform missing probability ( $r_{sh}$ ).
Imbalanced Missing Rate (IMR): Each modality $m$ has a distinct missing probability $r_m$ . The framework supports "mean-matched" IMR (where the average missing rate equals the SMR rate) to isolate the effect of imbalance from total data loss.

B. Diagnostic Metrics

To move beyond aggregate task performance, MissBench introduces two novel diagnostic indices:

Modality Equity Index (MEI):
- Purpose: Measures how fairly different modalities contribute to predictive performance.
- Mechanism: It evaluates the model on synthetic test sets where specific modalities are removed. It calculates the performance drop ( $s_m$ ) for each modality across all possible missing configurations.
- Formula: It normalizes these contributions into a probability distribution $p_{MEI}$ and calculates the Rényi entropy of order 2.
- Interpretation: $MEI \in [0, 1]$ . A score near 1 indicates balanced contribution; a score near 0 indicates one modality dominates.
Modality Learning Index (MLI):
- Purpose: Quantifies optimization imbalance by analyzing gradient dynamics during training.
- Mechanism: It tracks the magnitude of modality-specific gradient norms ( $G_m(t)$ ) across training iterations. It measures the deviation of individual modality gradient variations from the cross-modal average.
- Interpretation: $MLI \in [0, 1]$ . Lower values indicate balanced, synchronous updates; higher values suggest asynchronous optimization where one modality drives parameter updates (gradient dominance).

C. Implementation

MissBench provides a unified pipeline with a Model Plugin Interface. Researchers can integrate new methods by implementing forward and get_loss functions. The framework handles data loading, masking, training, and logging of both task metrics and diagnostic indices (MEI/MLI) automatically.

3. Key Contributions

MissBench Benchmark: The first benchmark to simultaneously standardize SMR and IMR protocols across multiple affective datasets with fixed splits and seeds for reproducibility.
Diagnostic Metrics (MEI & MLI): Introduction of metrics that expose modality inequity and optimization imbalance, which are invisible to standard task-level metrics.
Empirical Insights: A comprehensive study across three families of methods (IMR-aware, missing-modality handling, and gradient-based baselines) revealing that models robust under SMR often fail under IMR.

4. Experimental Results

The authors evaluated representative models (e.g., GCNet, RedCore, MCE, Ada2I) on the defined datasets.

Performance under SMR vs. IMR:
- While models generally degrade gracefully under SMR, switching to mean-matched IMR causes significant performance drops (e.g., Accuracy drops of 3–10 points on CMU-MOSI).
- Gradient Dominance: Under IMR, models tend to "lock" onto the most available modality (often Language/Text). For instance, in IEMOCAP, the language modality consistently drove larger parameter updates than visual or acoustic modalities, even when the missing rates were only moderately imbalanced.
Diagnostic Findings:
- MEI: Models that perform well under SMR often show low MEI under IMR, indicating that a single modality is doing the heavy lifting.
- MLI: IMR conditions lead to significantly higher MLI values, confirming that training becomes unstable and asynchronous.
- Trade-offs: In extreme IMR scenarios (e.g., $r_L=0.9, r_V=0.8, r_A=0.4$ ), methods fall into distinct regions in the MEI-MLI plane. Some methods sacrifice equity for stability, while others suffer from severe "language-locking" failures.
Specific Observations:
- RedCore and MCE (IMR-aware methods) showed better resilience but still exhibited increased MLI under high imbalance.
- Naive fusion and generic baselines suffered the most severe degradation in both task performance and modality equity.

5. Significance and Conclusion

MissBench addresses a critical blind spot in multimodal learning research. It demonstrates that high task accuracy does not guarantee robustness in realistic, incomplete-modality settings.

Practical Utility: It provides a "stress test" for multimodal models, revealing hidden biases where a model might rely too heavily on a single modality (e.g., text) and fail when that specific modality is degraded.
Future Directions: The framework motivates the development of new algorithms that jointly optimize for task performance, modality equity (MEI), and balanced learning dynamics (MLI), ensuring models are truly robust to the asymmetric nature of real-world data collection.

The code and benchmark are released to ensure reproducibility and facilitate future research in imbalanced multimodal learning.