ICYM2I: The illusion of multimodal informativeness under missingness

This paper introduces ICYM2I, a framework that corrects for bias in estimating multimodal information gain caused by distribution shifts in missingness patterns between source and target environments using inverse probability weighting.

Young Sang Choi, Vincent Jeanselme, Pierre Elias, Shalmali Joshi

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are a detective trying to solve a mystery. You have two main clues: a fingerprint (Modality 1) and a security camera video (Modality 2).

In a perfect world, every time a crime happens, you get both the fingerprint and the video. But in the real world, things get messy. Sometimes the camera breaks. Sometimes the suspect wears gloves, so there's no fingerprint. Sometimes, the police only look for fingerprints if the video looks suspicious.

This paper, titled "ICYM2I: The Illusion of Multimodal Informativeness Under Missingness," is about a very specific problem: How do we know which clue is actually useful if we keep losing some of them?

Here is the breakdown in simple terms:

1. The Problem: The "Perfect World" Trap

Most AI researchers train their models using "perfect" data. They throw away any case where a clue is missing.

  • The Analogy: Imagine a cooking show where the chef only ever cooks with fresh, perfect tomatoes. They never cook with bruised or missing tomatoes.
  • The Issue: When this chef tries to cook in a real kitchen where tomatoes might be bruised or missing, their recipe fails. They think, "Tomatoes are great!" because they only ever saw them when they were perfect. They don't realize that in the real world, the "missing tomato" situation changes everything.

The authors argue that current AI methods make a dangerous mistake: they assume that if a clue (like a camera or a medical test) is missing, it's just random bad luck. But often, it's not random.

  • Example: In a hospital, doctors might only order an X-ray if the patient's heart rate is already weird. If you only look at the patients who got X-rays, you might think the X-ray is super important. But if you look at the whole picture, you might realize the X-ray wasn't actually adding much new info; the heart rate was the real clue all along.

2. The Solution: ICYM2I (In Case You Multimodal Missed It)

The authors created a new framework called ICYM2I. Think of it as a "Time-Traveling Accountant" for data.

When you have missing data, your current dataset is a distorted, unbalanced version of reality. ICYM2I uses a statistical trick called Inverse Probability Weighting (IPW).

  • The Metaphor: Imagine you are trying to guess the average height of everyone in a city, but you only measured people who could reach the top shelf of a library. Naturally, your sample is full of tall people.
  • How ICYM2I fixes it: It looks at the people who didn't make it to the top shelf (the short people who were excluded). It says, "Okay, we missed a lot of short people. Let's give the tall people we did see a 'vote' of 1, but let's give the missing short people a 'vote' of 100."
  • By weighting the data this way, the AI can "see" the missing people statistically, even though they aren't physically in the dataset. This allows it to calculate the true value of a clue, not just the value it has in a biased sample.

3. Why This Matters: The "Illusion"

The paper shows that without this correction, we get an illusion.

  • The Illusion: We might think a specific medical test (like an X-ray) is amazing because it helps the AI predict heart disease.
  • The Reality (after ICYM2I): Once we correct for the fact that doctors only order X-rays on sick patients, we realize the X-ray isn't adding much new information. The other test (the ECG) was doing all the heavy lifting!

If we don't use ICYM2I, we might waste millions of dollars collecting X-rays for everyone, thinking they are crucial, when they are actually redundant.

4. The Real-World Test

The authors tested this on three things:

  1. Fake Data: They created a math puzzle where they intentionally hid clues. ICYM2I correctly identified which clues were actually useful.
  2. Semi-Real Data: They took real datasets (like memes with text and images) and pretended some were missing. ICYM2I fixed the bias.
  3. Real Medical Data: They looked at heart disease patients.
    • The Naive View: "X-rays are great! They help predict heart disease!"
    • The ICYM2I View: "Actually, X-rays are mostly just confirming what the ECG already told us. They aren't adding much unique value."

The Big Takeaway

"Just because you see a clue doesn't mean it's the most important one."

In the world of AI, if you ignore the fact that data is often missing (and why it's missing), you will build models that are biased and make bad decisions. ICYM2I is the tool that helps us see through the distortion, ensuring that when we decide to collect more data (like more X-rays or more sensors), we are actually collecting the right data, not just the data that looks good on paper.

In short: Don't trust the data you have until you account for the data you don't have. ICYM2I helps you do exactly that.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →