Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

Imagine you are trying to understand a movie by watching it with three friends: one who sees the visuals, one who hears the audio, and one who reads the subtitles.

In a perfect world, all three friends give you perfect information. But in the real world, things go wrong.

The "Missing" Friend: Sometimes, the friend who reads subtitles falls asleep or the camera breaks, so you get no information from them.
The "Noisy" Friend: Sometimes, the friend who hears audio is in a loud construction zone, or the friend who sees visuals is squinting through dirty glasses. They give you information, but it's garbled and confusing.

Most previous computer programs tried to fix these two problems separately. They had one team for "missing data" and a different team for "noisy data." But in real life, these problems often happen at the same time!

This paper introduces a new system called UMQ (Unified Modality-Quality). Think of UMQ as a super-smart movie director who manages a team of specialists to handle any combination of bad data, whether it's missing, noisy, or both.

Here is how UMQ works, broken down into three simple steps:

1. The Quality Inspector (The "Sniffer")

Before the director tries to fix anything, they need to know how bad the situation is.

The Problem: It's hard to say exactly "how bad" a piece of data is with a single number (like "this audio is 4.2 out of 10 bad").
The UMQ Solution: Instead of guessing an exact number, the system acts like a sports referee. It doesn't need to know the exact score; it just needs to know who is playing better.
- Analogy: Imagine a referee comparing two runners. They don't need to know the exact speed of each runner; they just need to know, "Runner A is faster than Runner B."
- The system compares different pieces of data and ranks them. This helps it learn to spot "bad" data without getting confused by fake or inaccurate labels.

2. The Quality Enhancer (The "Restorer")

Once the system knows which data is bad, it tries to fix it. But it doesn't just guess randomly. It uses two specific tools:

Tool A: The "Sample-Specific" Clue: It looks at what the other friends (modalities) are saying about this specific scene.
- Analogy: If the audio friend is shouting nonsense, but the visual friend sees the character crying, the system knows the audio is likely wrong and uses the visual clue to guess what the audio should have been.
Tool B: The "Modality-Specific" Baseline: It remembers what "good" audio or "good" video usually looks like in general.
- Analogy: Think of this as a reference library. Even if the current audio is noisy, the system knows that human voices usually have a certain rhythm. It uses this "library" of normal sounds to clean up the noisy audio.
The Result: The system rebuilds the bad data, making it sound and look much closer to the original, high-quality version.

3. The Expert Switchboard (The "Traffic Controller")

This is the most clever part. The system realizes that fixing a "missing" video is different from fixing "noisy" audio.

The Problem: If you have 3 types of data (Text, Audio, Video), and each can be either "Good" or "Bad," there are 8 different scenarios (e.g., Good Text + Bad Audio + Missing Video). A single, generic computer brain is too slow and clumsy to handle all 8 scenarios perfectly.
The UMQ Solution: The system uses a Mixture of Experts (MoE). Imagine a hospital with 8 different specialized doctors.
- Doctor #1 only treats "Missing Video" cases.
- Doctor #2 only treats "Noisy Audio" cases.
- Doctor #3 treats "Missing Video AND Noisy Audio" cases.
The Routing: A smart traffic controller (the "Router") looks at the incoming data, figures out exactly which "badness combination" it is, and instantly sends it to the specific doctor who is best at fixing that exact problem.

Why is this a big deal?

Real-World Ready: Real life is messy. Cameras break, microphones pick up wind, and people talk over each other. UMQ is built specifically for this messiness.
Better Results: The authors tested this on many datasets (like analyzing movie clips for emotions or detecting sarcasm). UMQ consistently beat the previous best methods, even when the data was heavily damaged or incomplete.
One Solution for All: Instead of building a different tool for every type of error, UMQ is a single, flexible framework that adapts to whatever problem it faces.

In short: UMQ is like a highly organized rescue team. It first assesses the damage, then uses specific tools to repair the broken pieces, and finally sends the job to the exact specialist needed to finish the job perfectly. This makes computers much better at understanding the messy, imperfect real world.

1. Problem Statement

Real-world multimodal data (e.g., video, audio, text) often suffers from low quality, manifesting primarily in two forms:

Missing Modalities: Caused by sensor failures, equipment unavailability, or data transmission errors.
Noisy Modalities: Caused by background interference, sensor inaccuracies, or transmission artifacts.

Limitations of Prior Work: Existing approaches typically handle missing and noisy modalities separately. This separation limits model robustness because these issues frequently co-occur in real-world scenarios. Furthermore, many methods rely on absolute quality labels (which are hard to define accurately) or fail to preserve modality-specific information when reconstructing missing data.

Goal: The authors propose a unified framework to jointly address both missing and noisy modalities, treating them as a single "low-quality modality" problem to enhance model robustness in Multimodal Affective Computing (MAC).

2. Methodology: The Unified Modality-Quality (UMQ) Framework

The UMQ framework consists of three synergistic components designed to quantify, enhance, and adaptively process low-quality representations.

A. Quality Estimator with Rank-Guided Training

Instead of relying on difficult-to-obtain absolute quality labels, UMQ trains a Quality Estimator for each modality using explicit supervised signals via a Rank-Guided Training Strategy.

Explicit Signals: The model identifies "highest-quality" instances (low predictive loss) and "lowest-quality" instances (simulated Gaussian noise) to assign extreme labels (e.g., >0.95 and 0).
Rank-Guided Strategy: For intermediate cases, the model avoids absolute labels to prevent noise. Instead, it uses a ranking loss to compare the relative quality of different representations based on their unimodal predictive losses. This allows the estimator to learn accurate ordinal quality judgments without needing precise absolute scores.
Noise Simulation: The training includes an "AddNoise" function to simulate real-world corruption, ensuring the estimator can distinguish between original and corrupted features.

B. Quality Enhancer with Modality Decoupling

To recover high-quality features, UMQ employs a Quality Enhancer that leverages two types of information:

Sample-Specific Information: Derived from other modalities in the same sample.
Modality-Specific Information: Derived from a Modality Baseline Representation ( $x^b_m$ $x_{m}^{b}$ ).
- Mechanism: The framework uses a Modality Decoupling Operation to separate a unimodal representation ( $x_m$ ) into sample-specific ( $x^s_m$ ) and modality-shared ( $x^c_m$ ) components. The modality-shared component is used to construct a baseline representation ( $x^b_m$ ) via a moving average and a trainable bias embedding.
- Enhancement: The enhancer takes the noisy/missing input, the quality scores of other modalities, and the modality baseline to reconstruct a high-quality representation ( $\bar{x}_m$ ). This ensures the generated features retain the inherent characteristics of the specific modality, avoiding the "generic" features often produced by simple reconstruction methods.

C. Modality-Quality-Aware Mixture-of-Experts (MQ-MoE)

Since the number of possible quality combinations for $|M|$ modalities is $2^{|M|}$ (e.g., high/low quality for text, audio, and video), a single shared predictor is insufficient.

Architecture: UMQ introduces an MQ-MoE module with specialized expert modules.
Routing Mechanism: A quality-aware router dynamically selects the top- $k$ experts for each sample based on its specific modality-quality configuration.
Constraints:
- $L_{same}$ : Ensures samples with identical modality-quality configurations (e.g., both have noisy audio and missing text) are routed to the same set of experts.
- $L_{balance}$ & $L_{sample}$ : Prevents expert collapse and ensures diverse routing for different configurations.
Benefit: This allows the unified framework to handle diverse missing/noise scenarios specifically and effectively.

3. Key Contributions

Unified Framework: First to jointly address missing and noisy modalities in a single framework, treating missing data as a special case of noisy data.
Rank-Guided Quality Estimation: Proposes a novel training strategy that uses relative ranking constraints instead of inaccurate absolute labels, significantly improving the accuracy of quality estimation.
Modality-Specific Enhancement: Introduces a quality enhancer that combines sample-specific cues with a learned modality baseline, ensuring reconstructed features preserve modality-specific details.
MQ-MoE Architecture: Designs a specialized Mixture-of-Experts system with routing constraints that adaptively handles the combinatorial explosion of modality-quality configurations.

4. Experimental Results

The UMQ framework was evaluated on multiple datasets (CMU-MOSI, CMU-MOSEI, CH-SIMS, UR-FUNNY, MUStARD) across three tasks: Multimodal Sentiment Analysis (MSA), Humor Detection (MHD), and Sarcasm Detection (MSD).

Complete Modalities: UMQ achieved State-of-the-Art (SOTA) performance even when all modalities were present, outperforming strong baselines like Multimodal Boosting and AtCAF. This demonstrates the framework's ability to enhance representations even without explicit degradation.
Missing Modalities: Under missing rates (MR) ranging from 0.1 to 0.7, UMQ consistently outperformed baselines (e.g., GCNet, MoMKE, CIDer). On CMU-MOSI, it improved Acc2 by 1.4 points and Acc7 by 8.6 points over the second-best method.
Noisy Modalities: Tested with Gaussian noise (10%–70% noise rate), UMQ maintained stable performance and significantly outperformed baselines (C-MIB, Multimodal Boosting) in Mean Absolute Error (MAE) and Accuracy.
Generalization: The model showed robustness to unseen noise types (Laplace noise, random erasing) and out-of-distribution (OOD) scenarios.
Ablation Studies: Confirmed that removing the Quality Estimator, Rank-Guided Training, or the Modality Baseline representation caused significant performance drops, validating the necessity of each component.

5. Significance

Practical Applicability: By unifying the handling of missing and noisy data, UMQ bridges the gap between controlled lab environments and the messy reality of real-world data acquisition.
Robustness: The framework significantly enhances model reliability in scenarios where sensors fail or data is corrupted, a critical requirement for deploying multimodal AI in the wild.
Methodological Advancement: The shift from absolute quality labels to rank-guided relative learning offers a new paradigm for training quality-aware models in multimodal learning, potentially applicable to other domains beyond affective computing.
Efficiency: Despite handling complex combinatorial quality states, the MQ-MoE architecture maintains a reasonable parameter count while achieving superior performance.

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

1. The Quality Inspector (The "Sniffer")

2. The Quality Enhancer (The "Restorer")

3. The Expert Switchboard (The "Traffic Controller")

Why is this a big deal?

1. Problem Statement

2. Methodology: The Unified Modality-Quality (UMQ) Framework

A. Quality Estimator with Rank-Guided Training

B. Quality Enhancer with Modality Decoupling

C. Modality-Quality-Aware Mixture-of-Experts (MQ-MoE)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling

Generating Counterfactual Patient Timelines from Real-World Data

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models