Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

This paper proposes the Unified Modality-Quality (UMQ) framework, which jointly addresses missing and noisy modalities in multimodal affective computing through a rank-guided quality estimator, a quality enhancer leveraging cross-modal and baseline information, and a quality-aware mixture-of-experts module, thereby outperforming state-of-the-art methods across various low-quality data scenarios.

Sijie Mai, Shiqin Han, Haifeng Hu

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are trying to understand a movie by watching it with three friends: one who sees the visuals, one who hears the audio, and one who reads the subtitles.

In a perfect world, all three friends give you perfect information. But in the real world, things go wrong.

  • The "Missing" Friend: Sometimes, the friend who reads subtitles falls asleep or the camera breaks, so you get no information from them.
  • The "Noisy" Friend: Sometimes, the friend who hears audio is in a loud construction zone, or the friend who sees visuals is squinting through dirty glasses. They give you information, but it's garbled and confusing.

Most previous computer programs tried to fix these two problems separately. They had one team for "missing data" and a different team for "noisy data." But in real life, these problems often happen at the same time!

This paper introduces a new system called UMQ (Unified Modality-Quality). Think of UMQ as a super-smart movie director who manages a team of specialists to handle any combination of bad data, whether it's missing, noisy, or both.

Here is how UMQ works, broken down into three simple steps:

1. The Quality Inspector (The "Sniffer")

Before the director tries to fix anything, they need to know how bad the situation is.

  • The Problem: It's hard to say exactly "how bad" a piece of data is with a single number (like "this audio is 4.2 out of 10 bad").
  • The UMQ Solution: Instead of guessing an exact number, the system acts like a sports referee. It doesn't need to know the exact score; it just needs to know who is playing better.
    • Analogy: Imagine a referee comparing two runners. They don't need to know the exact speed of each runner; they just need to know, "Runner A is faster than Runner B."
    • The system compares different pieces of data and ranks them. This helps it learn to spot "bad" data without getting confused by fake or inaccurate labels.

2. The Quality Enhancer (The "Restorer")

Once the system knows which data is bad, it tries to fix it. But it doesn't just guess randomly. It uses two specific tools:

  • Tool A: The "Sample-Specific" Clue: It looks at what the other friends (modalities) are saying about this specific scene.
    • Analogy: If the audio friend is shouting nonsense, but the visual friend sees the character crying, the system knows the audio is likely wrong and uses the visual clue to guess what the audio should have been.
  • Tool B: The "Modality-Specific" Baseline: It remembers what "good" audio or "good" video usually looks like in general.
    • Analogy: Think of this as a reference library. Even if the current audio is noisy, the system knows that human voices usually have a certain rhythm. It uses this "library" of normal sounds to clean up the noisy audio.
  • The Result: The system rebuilds the bad data, making it sound and look much closer to the original, high-quality version.

3. The Expert Switchboard (The "Traffic Controller")

This is the most clever part. The system realizes that fixing a "missing" video is different from fixing "noisy" audio.

  • The Problem: If you have 3 types of data (Text, Audio, Video), and each can be either "Good" or "Bad," there are 8 different scenarios (e.g., Good Text + Bad Audio + Missing Video). A single, generic computer brain is too slow and clumsy to handle all 8 scenarios perfectly.
  • The UMQ Solution: The system uses a Mixture of Experts (MoE). Imagine a hospital with 8 different specialized doctors.
    • Doctor #1 only treats "Missing Video" cases.
    • Doctor #2 only treats "Noisy Audio" cases.
    • Doctor #3 treats "Missing Video AND Noisy Audio" cases.
  • The Routing: A smart traffic controller (the "Router") looks at the incoming data, figures out exactly which "badness combination" it is, and instantly sends it to the specific doctor who is best at fixing that exact problem.

Why is this a big deal?

  • Real-World Ready: Real life is messy. Cameras break, microphones pick up wind, and people talk over each other. UMQ is built specifically for this messiness.
  • Better Results: The authors tested this on many datasets (like analyzing movie clips for emotions or detecting sarcasm). UMQ consistently beat the previous best methods, even when the data was heavily damaged or incomplete.
  • One Solution for All: Instead of building a different tool for every type of error, UMQ is a single, flexible framework that adapts to whatever problem it faces.

In short: UMQ is like a highly organized rescue team. It first assesses the damage, then uses specific tools to repair the broken pieces, and finally sends the job to the exact specialist needed to finish the job perfectly. This makes computers much better at understanding the messy, imperfect real world.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →