Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification

Imagine you are a detective trying to solve a mystery. You have a team of specialists: a Photographer, a Voice Analyst, and a Document Reader. Usually, you get all three reports, and solving the case is easy.

But in the real world, things go wrong. Sometimes the Photographer's camera breaks (missing image), or the Voice Analyst is on vacation (missing audio). You are left with incomplete information.

For a long time, AI researchers had two bad ways to handle this:

The "Ignore" Strategy: "Oh, the Photographer is missing? No problem! I'll just guess based on what the Voice Analyst says."
- The Problem: You might miss a crucial clue that only the photo could have provided. You are throwing away valuable evidence.
The "Fake It" Strategy: "The Photographer is missing? No worries! I'll use a computer program to guess what the photo looks like and pretend it's real."
- The Problem: The computer's guess might be blurry, wrong, or completely unrelated to the crime. If you use this fake photo, you might get confused and solve the wrong case. You are introducing "noise" or lies into your investigation.

This is called the "Discarding-Imputation Dilemma": Do you throw away the missing info (and lose clues), or do you try to fake it (and risk getting lied to)?

Enter DyMo: The Smart Detective

The paper introduces a new AI framework called DyMo (Dynamic Modality Selection). Instead of choosing between ignoring or faking, DyMo acts like a smart, adaptive detective who knows exactly when to trust a fake clue and when to ignore it.

Here is how DyMo works, using simple analogies:

1. The "Taste Test" (The Reward Function)

Imagine you have a bowl of soup (your current data). You are about to add a new ingredient (a "recovered" or faked photo).

Old AI: Just dumps the ingredient in. If it's rotten, the soup tastes bad.
DyMo: Before adding the ingredient, it does a "taste test." It asks: "If I add this fake photo, does my soup (my prediction) get better or worse?"
- If the fake photo makes the soup taste better (the AI becomes more confident in its answer), DyMo adds it.
- If the fake photo makes the soup taste worse (the AI gets confused), DyMo throws it away immediately.

2. The "Trust Score" (Intra-Class Similarity)

Sometimes, a fake photo looks okay, but it's actually misleading.

The Analogy: Imagine you are looking for a specific type of dog. You have a picture of a Golden Retriever. The computer generates a fake picture of a dog that looks like a Golden Retriever but is actually a wolf in a costume.
DyMo's Trick: It checks the "vibe" of the fake dog against a mental list of real Golden Retrievers it learned during training. If the fake dog feels "off" compared to the real ones, DyMo lowers its trust score and ignores it. This stops the AI from being tricked by "semantic misalignment" (things that look right but are wrong).

3. The "Iterative Selection" (Building the Puzzle Piece by Piece)

DyMo doesn't just grab all the fake clues at once. It builds the solution step-by-step.

Step 1: Look at what you have.
Step 2: Ask, "Which single missing piece, if I faked it, would help me the most?"
Step 3: Add that piece.
Step 4: Ask again, "Now that I have that, which next piece helps?"
Step 5: Stop adding pieces as soon as a new one starts to hurt your confidence.

This ensures the AI only uses the "high-quality" fake clues and discards the junk.

Why is this a big deal?

In the real world, data is messy. In medical diagnosis, a patient might be missing an MRI scan.

Old AI: Might ignore the MRI and guess based only on blood work (missing a tumor), OR it might generate a blurry, fake MRI that looks like a healthy brain, leading to a missed diagnosis.
DyMo: It tries to generate the MRI. If the generated MRI looks like a healthy brain but the blood work suggests a tumor, DyMo realizes, "Wait, this fake MRI contradicts the truth. I won't use it." It sticks to the reliable blood work. But if the fake MRI clearly shows a tumor that matches the blood work, it uses that too!

The Bottom Line

DyMo is like a smart filter for AI. It acknowledges that we can't always get perfect data. Instead of blindly trusting computer-generated guesses or blindly ignoring missing data, it dynamically tests every piece of recovered information. It only keeps the clues that make the AI smarter and throws away the ones that make it dumber.

The result? An AI that is much more reliable in the messy, imperfect real world, whether it's diagnosing diseases, recognizing faces, or analyzing marketing trends.

1. Problem Definition: The Discarding-Imputation Dilemma

Multimodal Deep Learning (MDL) often faces the challenge of incomplete data, where one or more modalities (e.g., images, text, tabular data) are missing during inference due to sensor failures or transmission errors. Existing solutions generally fall into two categories, both of which suffer from intrinsic limitations:

Recovery-Free (Discarding): These methods ignore missing modalities and rely solely on available ones. While robust, they risk discarding valuable task-relevant information if the missing modality was critical.
Recovery-Based (Imputation): These methods attempt to reconstruct missing modalities (e.g., using VAEs) and fuse them with observed data. However, reconstructions can be low-fidelity (noisy/blurry) or semantically misaligned (reconstructed content does not match the true label). Integrating such unreliable data injects noise, degrading performance.

The authors term this trade-off the "discarding-imputation dilemma": discarding loses information, while imputing risks introducing harmful noise. Current dynamic fusion methods often fail to distinguish between high-quality and low-quality/misaligned recovered modalities at inference time.

2. Methodology: The DyMo Framework

The authors propose DyMo, a novel inference-time dynamic modality selection framework designed to adaptively fuse only reliable recovered modalities. The framework consists of three core components:

A. Flexible Multimodal Architecture

Design: A transformer-based network capable of handling arbitrary subsets of input modalities.
Mechanism: It uses modality-specific encoders to extract features, concatenates them with a learnable [CLS] token, and processes them through a multimodal transformer. Missing modalities are handled via dummy tokens and attention masks to prevent distortion of the representation.
Training: The network is trained using an Incomplete Modality Simulation strategy, where random subsets of modalities are dropped during training to ensure robustness. An auxiliary Missing-Agnostic Contrastive Loss is added to enforce intra-class clustering and inter-class separation in the latent space, regardless of missing patterns.

B. Multimodal Task-Relevant Information Reward (MTIR)

The core innovation is a selection algorithm that estimates the incremental task-relevant information gained by adding a recovered modality.

Theoretical Basis: The authors establish a theoretical link between Mutual Information $I(Y; Z)$ (between labels $Y$ and representations $Z$ ) and the Task Loss (Cross-Entropy). They prove that reducing the empirical test-time Cross-Entropy loss increases the lower bound of task-relevant information.
Reward Function: Instead of directly estimating mutual information (which is intractable), DyMo uses the decrease in Cross-Entropy loss as a proxy for information gain.
- Positive Reward: The recovered modality reduces loss (adds relevant info).
- Zero/Negative Reward: The modality adds noise or is semantically misaligned.
Intra-Class Similarity Calibration: To handle cases where the predicted class changes but the representation distance is ambiguous, the authors introduce a calibration term ( $\alpha$ ) based on Intra-Class Similarity (ICS). This measures how representative a sample is within its predicted class cluster (using class prototypes and Bregman divergence/Cosine distance). If a recovered modality moves the representation to a less representative region, the reward is down-weighted.

C. Iterative Selection Algorithm

DyMo employs an iterative greedy selection process (Algorithm 1) at inference:

Recover all missing modalities.
Calculate the calibrated MTIR reward for each candidate recovered modality.
Select the modality with the highest positive reward and integrate it.
Remove all modalities with non-positive rewards from the candidate pool.
Repeat until no beneficial modalities remain.
This ensures that only the most informative, reliable modalities are fused, preventing noise accumulation.

3. Key Contributions

Problem Formulation: First to explicitly identify and address the "discarding-imputation dilemma" in incomplete MDL, moving beyond the binary choice of discarding or imputing.
DyMo Framework: A novel dynamic selection framework that adaptively fuses recovered modalities based on task-relevant information gain, rather than static weights or modality-specific features.
Theoretical Insight: Derives a tractable proxy (loss reduction) for maximizing mutual information at inference time, grounded in information theory and metric learning.
Robustness: Introduces a calibration mechanism to detect and reject semantically misaligned or low-fidelity reconstructions without requiring ground-truth labels at test time.
Generalizability: The framework is agnostic to the specific recovery method (e.g., VAEs, diffusion models) and can be deployed with any modality imputation technique.

4. Experimental Results

The authors evaluated DyMo on 5 diverse datasets (PolyMNIST, MST, CelebA, Data Visual Marketing, and UK Biobank) covering natural images, medical images, text, and tabular data.

Performance: DyMo significantly outperforms State-of-the-Art (SOTA) incomplete MDL methods (both recovery-based and recovery-free) and dynamic fusion techniques.
- PolyMNIST: Achieved 1.61% to 3.88% higher accuracy compared to SOTAs under severe missing scenarios (e.g., 80% missing modalities).
- Medical Data (UKBB): Showed substantial improvements in AUC for coronary artery disease and myocardial infarction classification, particularly when tabular data was heavily missing.
Ablation Studies:
- Removing the selection mechanism (integrating all recovered modalities) led to performance drops, confirming that not all imputed data is useful.
- The calibration term and iterative selection were shown to be critical for handling misaligned reconstructions.
Robustness: DyMo maintained high performance even when paired with different recovery methods (MoPoE, MMVAE+, CMVAE) or when the reconstruction quality was artificially degraded, demonstrating its ability to filter out noise.
Visualization: t-SNE and PCA visualizations confirmed that DyMo creates a more discriminative latent space by excluding unreliable recovered modalities that would otherwise cause class overlap.

5. Significance

This paper represents a paradigm shift in handling incomplete multimodal data. Instead of treating all recovered data as equally valuable or ignoring missing data entirely, DyMo introduces a dynamic, inference-time decision process that acts as a "quality filter" for imputed data.

Practical Impact: It enables the deployment of robust multimodal systems in real-world scenarios (e.g., healthcare, autonomous driving) where data completeness cannot be guaranteed and imputation quality varies.
Efficiency: The method requires no additional architecture overhead for the dynamic algorithm and relies on a single-stage training process, making it easy to integrate into existing pipelines.
Future Direction: The authors suggest the framework can be extended to other tasks like segmentation and detection, provided the task loss can be decomposed similarly.

In summary, DyMo solves the critical bottleneck of unreliable data imputation in multimodal learning by theoretically grounding modality selection in task-relevant information gain, resulting in superior accuracy and robustness across diverse real-world applications.