ICYM2I: The illusion of multimodal informativeness under missingness

Imagine you are a detective trying to solve a mystery. You have two main clues: a fingerprint (Modality 1) and a security camera video (Modality 2).

In a perfect world, every time a crime happens, you get both the fingerprint and the video. But in the real world, things get messy. Sometimes the camera breaks. Sometimes the suspect wears gloves, so there's no fingerprint. Sometimes, the police only look for fingerprints if the video looks suspicious.

This paper, titled "ICYM2I: The Illusion of Multimodal Informativeness Under Missingness," is about a very specific problem: How do we know which clue is actually useful if we keep losing some of them?

Here is the breakdown in simple terms:

1. The Problem: The "Perfect World" Trap

Most AI researchers train their models using "perfect" data. They throw away any case where a clue is missing.

The Analogy: Imagine a cooking show where the chef only ever cooks with fresh, perfect tomatoes. They never cook with bruised or missing tomatoes.
The Issue: When this chef tries to cook in a real kitchen where tomatoes might be bruised or missing, their recipe fails. They think, "Tomatoes are great!" because they only ever saw them when they were perfect. They don't realize that in the real world, the "missing tomato" situation changes everything.

The authors argue that current AI methods make a dangerous mistake: they assume that if a clue (like a camera or a medical test) is missing, it's just random bad luck. But often, it's not random.

Example: In a hospital, doctors might only order an X-ray if the patient's heart rate is already weird. If you only look at the patients who got X-rays, you might think the X-ray is super important. But if you look at the whole picture, you might realize the X-ray wasn't actually adding much new info; the heart rate was the real clue all along.

2. The Solution: ICYM2I (In Case You Multimodal Missed It)

The authors created a new framework called ICYM2I. Think of it as a "Time-Traveling Accountant" for data.

When you have missing data, your current dataset is a distorted, unbalanced version of reality. ICYM2I uses a statistical trick called Inverse Probability Weighting (IPW).

The Metaphor: Imagine you are trying to guess the average height of everyone in a city, but you only measured people who could reach the top shelf of a library. Naturally, your sample is full of tall people.
How ICYM2I fixes it: It looks at the people who didn't make it to the top shelf (the short people who were excluded). It says, "Okay, we missed a lot of short people. Let's give the tall people we did see a 'vote' of 1, but let's give the missing short people a 'vote' of 100."
By weighting the data this way, the AI can "see" the missing people statistically, even though they aren't physically in the dataset. This allows it to calculate the true value of a clue, not just the value it has in a biased sample.

3. Why This Matters: The "Illusion"

The paper shows that without this correction, we get an illusion.

The Illusion: We might think a specific medical test (like an X-ray) is amazing because it helps the AI predict heart disease.
The Reality (after ICYM2I): Once we correct for the fact that doctors only order X-rays on sick patients, we realize the X-ray isn't adding much new information. The other test (the ECG) was doing all the heavy lifting!

If we don't use ICYM2I, we might waste millions of dollars collecting X-rays for everyone, thinking they are crucial, when they are actually redundant.

4. The Real-World Test

The authors tested this on three things:

Fake Data: They created a math puzzle where they intentionally hid clues. ICYM2I correctly identified which clues were actually useful.
Semi-Real Data: They took real datasets (like memes with text and images) and pretended some were missing. ICYM2I fixed the bias.
Real Medical Data: They looked at heart disease patients.
- The Naive View: "X-rays are great! They help predict heart disease!"
- The ICYM2I View: "Actually, X-rays are mostly just confirming what the ECG already told us. They aren't adding much unique value."

The Big Takeaway

"Just because you see a clue doesn't mean it's the most important one."

In the world of AI, if you ignore the fact that data is often missing (and why it's missing), you will build models that are biased and make bad decisions. ICYM2I is the tool that helps us see through the distortion, ensuring that when we decide to collect more data (like more X-rays or more sensors), we are actually collecting the right data, not just the data that looks good on paper.

In short: Don't trust the data you have until you account for the data you don't have. ICYM2I helps you do exactly that.

1. Problem Statement

Multimodal learning aims to improve predictive performance by combining different data modalities (e.g., text, images, sensors). However, a critical, often overlooked issue is missingness: the phenomenon where certain modalities are absent in the target environment due to cost, hardware failure, privacy, or selection bias.

Current multimodal methods typically suffer from two major flaws regarding missingness:

Implicit Assumption of Stability: They assume the missingness pattern in the training (source) data is identical to the target deployment environment, or they discard incomplete samples during curation.
Biased Informativeness Estimation: When missingness is not random (e.g., Missing At Random - MAR, or Missing Not At Random - MNAR), the observed distribution ( $\Omega_{obs}$ ) differs from the true underlying distribution ( $\Omega$ ). Naïve evaluation on $\Omega_{obs}$ leads to biased estimates of a modality's predictive utility and information-theoretic value. This can result in flawed decisions, such as collecting expensive modalities that appear informative only because of the specific missingness pattern in the training data.

The paper formalizes this as a distribution shift problem where the mechanism of missingness induces a bias in estimating how much information a modality actually provides.

2. Methodology: ICYM2I Framework

The authors propose ICYM2I (In Case You Multimodal Missed It), a framework designed to correct for missingness-induced distribution shifts to provide unbiased estimates of model performance and information gain.

Core Assumptions

MAR (Missing At Random): The missingness mechanism depends only on observed covariates ( $C$ ), not on unobserved values. This is a more realistic assumption than the common MCAR (Missing Completely At Random) assumption.
Positivity: There is a non-zero probability of observing any combination of modalities given the covariates.

Key Components

The framework utilizes Inverse Probability Weighting (IPW) to reweight samples, effectively recovering the true underlying distribution from the observed, biased data. It operates in two modes:

A. ICYM2I-Learn (Predictive Performance)

Training: Instead of standard loss minimization on observed data, the model minimizes a weighted loss function. Samples are up-weighted by the inverse of their probability of being observed ( $1 / P(\text{missing} | C)$ ). This corrects the training distribution from $\Omega_{obs}$ to $\Omega$ .
Evaluation: Standard evaluation on a held-out set from $\Omega_{obs}$ is also biased. ICYM2I applies IPW correction to evaluation metrics (e.g., AUROC, Brier score) to estimate performance as if the model were evaluated on the full, true distribution $\Omega$ .
Result: The model learns to infer on the underlying distribution, and its performance is estimated without bias from the missingness pattern.

B. ICYM2I-PID (Information-Theoretic Value)

Goal: To quantify the unique, shared, and complementary information provided by modalities using Partial Information Decomposition (PID).
Challenge: Standard PID estimation on $\Omega_{obs}$ yields biased decomposition (e.g., overestimating unique information if a modality drives the missingness of another).
Solution:
1. Corrected Mutual Information: The authors derive a formula for IPW-corrected mutual information $I_{IPW}(Y : (X_1, X_2))$ to estimate the true dependency.
2. Optimization: They adapt the PID optimization process (minimizing three-way mutual information subject to marginal constraints) to use the IPW-corrected distributions.
3. Algorithm: They employ a modified Sinkhorn-Knopp procedure to enforce that the estimated joint distribution $q$ matches the IPW-corrected marginals of the true distribution, rather than the observed ones.

3. Key Contributions

Formalization of Missingness as Distribution Shift: The paper rigorously defines missingness in multimodal learning not just as a data quality issue, but as a source of distribution shift ( $\Omega_{obs} \neq \Omega$ ) that biases both performance and information estimates.
The ICYM2I Framework: A novel double-correction mechanism (training and evaluation) using IPW to recover unbiased estimates under MAR assumptions.
Unbiased PID Estimation: Extending Partial Information Decomposition to handle missing data, allowing practitioners to accurately measure the "Unique," "Shared," and "Complementary" information of modalities without the illusion created by missingness patterns.
Empirical Validation: Demonstrated across synthetic, semi-synthetic, and real-world datasets, showing that naïve approaches often conflate missingness with signal.

4. Experimental Results

The authors evaluated ICYM2I on three types of datasets:

Synthetic (Bitwise Logic):
- In logic tasks (AND, OR, XOR), naïve observation ( $\Omega_{obs}$ ) led to significant errors in estimating unique information. For example, in an OR operation, missingness dependent on $X_1$ caused $X_1$ to appear to have high unique information when it was actually redundant.
- ICYM2I successfully recovered the ground-truth information decomposition, matching the "Oracle" (full data) results.
Semi-Synthetic (UR-FUNNY & Hateful Memes):
- Using real-world datasets (humor detection, hate speech) with enforced missingness (30-70%), the study showed that standard methods underestimated or overestimated modality contributions.
- ICYM2I consistently provided PID estimates closer to the Oracle (full data) baseline compared to naïve observation, even as missingness rates increased.
Real-World Clinical Case Study (Structural Heart Disease):
- Task: Detecting Structural Heart Disease (SHD) using Electrocardiograms (ECG) and Chest Radiographs (CXR). CXRs are often missing in patients who have ECGs.
- Finding: Naïve analysis suggested CXRs provided ~5% unique information for SHD detection.
- ICYM2I Correction: After correcting for the fact that CXRs are only collected when specific clinical criteria are met (driving the missingness), the unique information of CXRs dropped to 1.8%. The shared information between ECG and CXR increased significantly.
- Implication: The naïve approach would suggest collecting CXRs for all SHD patients, whereas ICYM2I suggests ECGs are the primary driver, and CXRs add negligible unique value in this specific context.

5. Significance and Limitations

Significance:

Decision Making: The framework prevents costly data collection errors. If a modality appears informative only because of a specific missingness pattern, ICYM2I reveals its true (often lower) value.
Robustness: It provides a statistical foundation for evaluating multimodal models in real-world settings where data is rarely complete.
Healthcare Impact: The clinical case study demonstrates how ignoring missingness can lead to misinterpretation of diagnostic value, potentially affecting patient care protocols.

Limitations:

MAR Assumption: The method relies on the Missing At Random assumption. If data is MNAR (Missing Not At Random, dependent on unobserved values), the correction may still be biased, as the missingness mechanism cannot be fully inferred from observed data.
Scalability of PID: The current implementation focuses on two modalities. Extending Partial Information Decomposition to more than two modalities remains an open theoretical challenge.
Instance Requirement: The method requires paired data (an instance with potentially missing modalities). It does not apply to unaligned modalities where no instance-level correspondence exists.

In conclusion, ICYM2I offers a critical correction to the multimodal learning pipeline, shifting the focus from merely handling missing data to correcting the bias introduced by the missingness process itself, ensuring that the perceived value of data modalities reflects reality rather than artifacts of data collection.

ICYM2I: The illusion of multimodal informativeness under missingness

1. The Problem: The "Perfect World" Trap

2. The Solution: ICYM2I (In Case You Multimodal Missed It)

3. Why This Matters: The "Illusion"

4. The Real-World Test

The Big Takeaway

1. Problem Statement

2. Methodology: ICYM2I Framework

Core Assumptions

Key Components

3. Key Contributions

4. Experimental Results

5. Significance and Limitations

More like this

GPU-Accelerated Sequential Monte Carlo for Bayesian Spectral Analysis

FunctionalCalibration: an R package for estimation in aggregated functional data model

Generative Unsupervised Downscaling of Climate Models via Domain Alignment: Application to Wind Fields

On the complexity of standard and waste-free SMC samplers

The Long-Range Memory and the Fractal Dimension: a Case Study for Alcântara