REMIND: Rethinking Medical High-Modality Learning under Missingness--A Long-Tailed Distribution Perspective

Imagine you are a doctor trying to diagnose a patient. To get the full picture, you ideally want a complete "medical toolkit" for every single person: an X-ray, a blood test, a psychological evaluation, and a detailed history note.

In a perfect world, every patient would have all four. But in the real world, things are messy. Some patients can't afford the expensive blood test. Others are too sick for the invasive biopsy. Some just forgot to bring their history notes.

This creates a problem for Artificial Intelligence (AI). If you train an AI on this messy data, it gets really good at diagnosing patients who have all the tools (the "common" cases), but it becomes terrible at diagnosing the patients who are missing a few tools (the "rare" cases).

This paper, REMIND, is like a new, smarter way to train that AI doctor so it doesn't forget the patients with incomplete toolkits.

Here is the breakdown using simple analogies:

1. The Problem: The "Long-Tail" Crowd

Imagine a classroom where 90% of the students have a red pen, a blue pen, and a pencil. Only 10% have a red pen and a pencil. And a tiny, tiny group (1%) has only a pencil.

If you teach a class based on what "most" students have, the teacher will focus on the red and blue pens. The students with only a pencil get left behind. In AI terms, the "Head" groups (common data) get all the attention, while the "Tail" groups (rare, missing-data combinations) are ignored. The AI learns to be great at the common stuff but fails miserably when it sees a patient with a weird, rare combination of missing tests.

2. Why Old AI Fails: The "One-Size-Fits-All" Trap

The researchers found two main reasons why old AI methods fail here:

The "Confused Crowd" (Gradient Inconsistency): Imagine the AI is trying to walk in a straight line. The "common" students (Head groups) are all shouting "Go North!" The "rare" students (Tail groups) are shouting "Go East!" Because there are so many "North" shouters, the AI just goes North and ignores the "East" shouters. The AI never learns how to handle the "East" direction.
The "Wrong Recipe" (Concept Shift): This is the tricky part. The AI thinks, "If I have a blood test and an X-ray, I use Recipe A." But if a patient is missing the blood test, the AI tries to use Recipe A anyway, just with a hole in it. That doesn't work! The recipe needs to change completely based on what ingredients (modalities) you actually have. A recipe for a cake with eggs is different from a recipe for a cake without eggs.

3. The Solution: REMIND (The Smart Chef)

The authors propose REMIND, which stands for REthinking MultImodal learNing under high-moDality missingness. It uses two main tricks:

Trick A: The "Fairness Coach" (Group Distributionally Robust Optimization)

Instead of letting the "North" shouters drown out the "East" shouters, REMIND acts like a strict coach. It says, "I don't care how many people are shouting North. If the 'East' group is struggling, I will make them shout louder during practice."

How it works: It mathematically forces the AI to pay extra attention to the rare, missing-data groups during training, ensuring the AI doesn't ignore them.

Trick B: The "Modular Kitchen" (Soft Mixture-of-Experts)

Instead of one giant brain trying to cook every meal with one recipe, REMIND builds a kitchen with 32 different expert chefs (called "Experts").

The Shared Pool: All chefs share the same basic knowledge (like how to chop onions).
The Smart Switch: When a patient walks in, a "Router" looks at their specific toolkit.
- Patient has everything? The Router calls Chef #5.
- Patient is missing the blood test? The Router calls Chef #12, who specializes in "No-Blood-Test" recipes.
- Patient has a weird combo? The Router calls Chef #28.
The Special Touch: For the rare patients (the Tail groups), the Router gets a tiny "residual" note (a special instruction card) just for them, so it can fine-tune the recipe specifically for that rare situation without messing up the recipes for everyone else.

4. The Result

When the researchers tested this on real medical data (like breast imaging, ICU patient records, and eye scans), REMIND didn't just do well on the common patients; it shined on the rare, difficult cases where data was missing.

Old AI: "I've never seen a patient with just an eye scan and no blood work. I'm going to guess randomly."
REMIND: "Ah, a patient with just an eye scan. I have a specific expert chef trained exactly for this scenario. Let me give you a precise diagnosis."

The Takeaway

In the real world, data is rarely perfect. We can't force every patient to have every test. This paper teaches us that to build truly robust medical AI, we can't just train on the "average" patient. We need a system that respects the unique, messy combinations of real life, giving special attention to the rare cases so that no patient gets left behind.

1. Problem Definition

The paper addresses the challenge of High-Modality Learning under Missingness in medical applications. In real-world clinical settings, patients often have incomplete data due to cost, radiation exposure, or technical failures, leading to a scenario where a large number of diverse modalities (e.g., imaging, lab tests, clinical notes) are available, but not all are present for every patient.

Key Insight: The authors identify that missingness in high-modality settings creates a long-tailed distribution of Modality Combinations (MCs).

Exponential Growth: As the number of modalities ( $m$ ) increases, the number of possible combinations grows exponentially ( $2^m - 1$ ).
Imbalance: Common combinations (e.g., EHR + basic imaging) form the "head" of the distribution, while complex or rare combinations (e.g., EHR + 3D scan + specific lab tests) form the "tail."
The Gap: Existing methods fail to perform well on these tail groups, leading to significant underperformance for patients with rare data configurations.

2. Root Cause Analysis

Through empirical analysis, the authors attribute the poor performance on tail groups to two fundamental issues:

Gradient Inconsistency:
- In standard training, the global optimization direction is dominated by the "head" groups (frequent samples).
- The authors analyze the Neural Tangent Kernel (NTK) and find that the gradient directions of tail groups diverge significantly from the global gradient direction.
- Consequently, the model parameters are updated to optimize head groups, effectively ignoring or "under-optimizing" the tail groups.
Concept Shift:
- Unlike traditional long-tail classification where the mapping function $P(Y|X)$ is consistent across classes, missing modalities induce a concept shift.
- Each unique modality combination requires a distinct fusion function. For example, fusing "Lab + Vital Signs" captures physiological states, while adding "Clinical Notes" introduces contextual synergy, requiring a fundamentally different fusion strategy.
- Existing methods often use a single shared fusion function, which cannot adapt to these distinct interaction patterns.

3. Methodology: The REMIND Framework

The authors propose REMIND (REthinking MultImodal learNing under high-moDality missingness), a unified framework combining two core components:

A. Group Distributionally Robust Optimization (Group DRO)

To address gradient inconsistency, REMIND employs a Group DRO strategy.

Mechanism: Instead of minimizing the average loss over the entire dataset, the framework optimizes for the worst-case performance across groups.
Implementation: It dynamically assigns weights ( $\lambda_k$ ) to different modality combination groups. Groups with higher loss (typically the under-optimized tail groups) are up-weighted during training.
Goal: This ensures that the model is robust even for the rarest modality combinations, preventing the global gradient from being dominated solely by head groups.

B. Soft Mixture-of-Experts (Soft MoE) with Group-Specific Routing

To address concept shift, REMIND introduces a scalable architecture based on Soft MoE.

Shared Experts: A shared pool of expert networks processes multi-modal information, ensuring parameter efficiency and knowledge sharing.
Group-Specific Adaptive Routing:
- Instead of training separate experts for every combination (which is infeasible due to the exponential space), REMIND learns a shared routing matrix ( $\Phi_{shared}$ ) and group-specific residual matrices ( $\Phi_k$ ).
- The final routing function is $\Phi = \Phi_{shared} + \Phi_k$ .
- Uncertainty Gating: An entropy-based gating mechanism determines when to activate the group-specific residual. If the shared router is confident (low entropy), it uses $\Phi_{shared}$ . If the assignment is uncertain (high entropy, common in tail groups), it activates $\Phi_k$ to learn a specialized fusion strategy for that specific combination.
Learnable Embeddings: For missing modalities, the model uses learnable, group-specific embeddings rather than zero-padding, allowing the model to infer the "absence" of a modality contextually.

4. Key Contributions

New Perspective: First to formulate high-modality missingness as a long-tailed distribution problem, revealing that gradient inconsistency and concept shifts are the primary causes of failure in tail groups.
Novel Architecture: Proposes a Group-Specific Soft MoE that balances parameter efficiency (shared experts) with adaptability (residual routing matrices) to handle concept shifts across arbitrary modality combinations.
Robust Optimization: Integrates Group DRO to explicitly reweight underrepresented groups, ensuring robust performance across the entire distribution.
Empirical Validation: Demonstrates state-of-the-art performance on three real-world medical datasets (EMBED, MIMIC-IV, FPRM) with significant missingness.

5. Experimental Results

The framework was evaluated on three datasets:

EMBED: Breast imaging (4 modalities).
MIMIC-IV: ICU mortality prediction (3 modalities: text, labs, codes).
FPRM: Eye imaging and psychological assessment (4 modalities).

Key Findings:

Overall Performance: REMIND consistently outperforms state-of-the-art baselines (including Soft MoE, FuseMoE, FlexMoE, and standard Long-Tail methods like GroupDRO and FairMixup) across all datasets.
Tail Group Performance: The most significant gains are observed in tail groups (rare modality combinations), where REMIND significantly reduces the performance gap between head and tail groups.
Gradient Consistency: Analysis shows REMIND maintains higher gradient consistency for tail groups compared to baselines, confirming the effectiveness of the DRO strategy.
Extreme Missingness: In experiments simulating 80% missingness for specific modalities, REMIND maintained robust performance, whereas baselines degraded significantly.
Generalization: The model can adapt to unseen modality combinations (zero-shot scenarios) by only fine-tuning the router and prediction head, without retraining the entire expert pool.

6. Significance

This paper provides a critical theoretical and practical advancement for medical AI. By reframing missing modality issues through the lens of long-tailed distributions, it moves beyond simple imputation or generic fusion strategies. The proposed REMIND framework offers a scalable solution for real-world clinical environments where data completeness is never guaranteed, ensuring that diagnostic models remain accurate and fair for patients with rare or complex data profiles. This is essential for deploying reliable multi-modal AI in diverse healthcare settings.