Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

The Big Problem: The "Missing Piece" Puzzle

Imagine you are a detective trying to solve a crime. Usually, you have a full set of clues: a witness statement, a fingerprint, and a security video. But in the real world, things often go wrong. Maybe the witness forgot to show up, or the security camera was broken.

Most current AI models are like detectives who refuse to work unless they have every single clue. If one piece of evidence is missing, they either stop working or try to "guess" what the missing clue looks like and pretend it's real.

The problem with guessing: If you guess the missing clue, you might guess it wrong. And if you guess it wrong, your final conclusion (the prediction) might be wrong too. Plus, you don't know how much that missing clue actually mattered. Did the fingerprint change the outcome, or could you have solved the case with just the witness statement?

The Solution: PRIMO (The "What-If" Machine)

The authors of this paper created a new AI model called PRIMO. Instead of trying to guess the exact missing piece of evidence, PRIMO asks a different question: "How much would the answer change if we had different versions of the missing clue?"

Think of PRIMO as a Time-Traveling Detective or a Simulation Engine.

How It Works (The Analogy)

Imagine you are trying to predict if a patient will get sick. You have their Age (which you always know) and their Heart Rate (which might be missing).

The "What-If" Scenarios:
Instead of guessing one specific heart rate, PRIMO imagines 100 different possible heart rates that could be true based on the patient's age.
- Scenario A: The heart rate is 60.
- Scenario B: The heart rate is 90.
- Scenario C: The heart rate is 120.
Running the Simulation:
PRIMO runs the prediction for all 100 scenarios.
- If the patient is predicted to be "Sick" in all 100 scenarios, PRIMO says: "The missing heart rate doesn't matter. We are confident."
- If the patient is "Sick" when the heart rate is 120, but "Healthy" when it's 60, PRIMO says: "Whoa! The missing heart rate changes everything. We are very uncertain."
The Result:
PRIMO gives you two things:
- A Prediction: It averages out the 100 scenarios to give you the best guess.
- A Confidence Meter: It tells you how much the missing information actually matters for this specific person.

Why Is This Special?

Most AI models try to fill in the blank (Imputation). PRIMO tries to measure the impact (Characterization).

Old Way: "I think the missing heart rate was 85. So, the patient is sick." (If 85 is wrong, the whole thing is wrong).
PRIMO Way: "I don't know the heart rate. But if it's low, the patient is fine. If it's high, they are sick. Since I don't know, I'm going to tell you that the missing data is critical to this decision."

Real-World Examples from the Paper

The researchers tested PRIMO on three different "detective cases":

The XOR Game (Math Puzzle):
They used a simple math game where the answer depends on two numbers. Sometimes, knowing just one number was enough to solve it. Sometimes, you needed both. PRIMO correctly figured out: "For this specific math problem, the missing number doesn't matter," and "For that one, the missing number is the key."
Audio-Vision MNIST (Reading Digits):
Imagine a computer looking at a picture of a number (like "5") and listening to someone say "five."
- Sometimes the picture is blurry (missing vision).
- Sometimes the audio is static (missing sound).
  PRIMO found that for some numbers, the sound didn't matter much (you could see it clearly). But for others, the sound was the only thing that made sense of the blurry picture. It told the AI exactly when to trust the picture and when to worry about the missing sound.
MIMIC-III (Hospital Patients):
This was the big test. They looked at patient data: Static info (Age, gender, history) and Time-series info (Heart rate, blood pressure over 24 hours).
- Task 1: Predicting Cancer (Neoplasms). PRIMO found that the patient's age and history were enough. The missing heart rate data didn't change the prediction much.
- Task 2: Predicting Respiratory Failure. PRIMO found that the missing heart rate data was huge. Without it, the prediction was all over the place.
- Task 3: Predicting Death. PRIMO found that for young, healthy patients, the missing data didn't matter. But for very old patients, the missing heart rate data was the difference between predicting "Safe" and "Critical."

The "Variance" Metric (The Uncertainty Gauge)

The paper introduces a fancy math term called Variance, but you can think of it as a "Chaos Meter."

Low Chaos (Low Variance): The AI is calm. It says, "Even if I guess the missing data wrong, my answer stays the same."
High Chaos (High Variance): The AI is panicking. It says, "Depending on what the missing data actually is, my answer could be totally different!"

Why Should We Care?

Better Decisions: Doctors or judges can see when the AI is unsure because data is missing. They can then decide to order an extra test (like an MRI) only when the AI says, "Hey, I really need this missing piece to be sure!"
No Wasted Data: PRIMO can learn from patients who only have age data and patients who have both age and heart rate data. It doesn't throw away the incomplete records.
Understanding the AI: It stops the AI from being a "black box." It explains why a prediction might be shaky.

Summary

PRIMO is a smart AI that admits when it's missing information. Instead of blindly guessing the missing piece, it simulates many possibilities to see how much that missing piece actually changes the outcome. It tells us not just what the answer is, but how much we should trust it given the missing data. It's like a detective who knows exactly which clues are essential and which ones are just nice-to-haves.

1. Problem Statement

Multimodal Large Language Models (MLLMs) and traditional multimodal learning systems typically assume that all modalities (e.g., text, image, audio, clinical time-series) are available during both training and inference. However, in real-world scenarios—particularly in healthcare—data is often incomplete. Modalities may be missing due to asynchronous collection, high acquisition costs, or patient risk (e.g., avoiding unnecessary MRI scans).

Existing approaches to handle missing modalities generally fall into two categories, both of which have limitations:

Generative Imputation: Models (like VAEs) attempt to reconstruct the missing modality conditioned on observed data. However, optimizing for reconstruction does not guarantee better discriminative performance, as many plausible reconstructions may exist that are irrelevant to the prediction task.
Discarding Data: Some methods discard partially observed examples, training only on complete data, which leads to data inefficiency.

The Core Question: Instead of simply "filling in" the missing data, how can we quantify what the missing modality would actually change for a specific prediction? The goal is to characterize the predictive impact of a missing modality at the instance level.

2. Methodology: PRIMO

The authors propose PRIMO (Predictive Impact via Supervised Latent-Variable Modeling), a supervised latent-variable model designed to handle heterogeneous modality availability during both training and inference.

2.1 Model Architecture and Formulation

Latent Variable ( $z$ ): PRIMO models the information in the missing modality ( $x_m$ ) relevant to the target label ( $y$ ) using a continuous latent variable $z$ . It does not attempt to reconstruct $x_m$ directly but rather captures the uncertainty in $x_m$ that affects $y$ .
Conditional Independence: The model assumes $y$ is conditionally independent of $x_m$ given the observed modality $x_o$ and the latent variable $z$ .
Training Objective: PRIMO is trained end-to-end to maximize the predictive distribution $p(y | x_o)$ $p (y ∣ x_{o})$ when $x_m$ $x_{m}$ is missing, and $p(y | x_o, x_m)$ $p (y ∣ x_{o}, x_{m})$ when both are present. It optimizes Evidence Lower Bounds (ELBOs) for two cases:
1. Complete Modalities: Maximizes $\log p(y | x_o, x_m)$ using a variational posterior $q(z | x_o, x_m, y)$ and prior $p(z | x_o, x_m)$ .
2. Missing Modalities: Maximizes $\log p(y | x_o)$ using a variational posterior $q(z | x_o, y)$ and prior $p(z | x_o)$ .
Symmetry Breaking: To prevent the latent distributions from shifting arbitrarily (shift symmetry), the model anchors the prior for missing modalities to a standard normal distribution $\mathcal{N}(0, I)$ and regularizes the complete-modality prior to be close to the missing-modality prior.
Preventing Posterior Collapse: Batch normalization is applied to the posterior mean to ensure the KL divergence term remains non-zero, preventing the model from ignoring the latent variable.

2.2 Inference and Predictive Impact Quantification

During inference, the label $y$ is unknown. PRIMO performs the following steps:

Sampling: It draws $K$ samples of the latent variable $z$ from the learned conditional prior (either $p(z | x_o)$ if $x_m$ is missing, or $p(z | x_o, x_m)$ if available).
Marginal Prediction: The final prediction is the average of predictions across these $K$ samples: $\bar{p}(y | x_o) \approx \frac{1}{K} \sum p(y | x_o, z^{(k)})$ .
Variance-Based Metric ( $V$ ): To quantify the impact of the missing modality, PRIMO computes the Expected Total Variation Distance (TVD) between the predictive distribution for a specific sample $z$ $z$ and the mean predictive distribution:
$V = \mathbb{E}_{z} [\text{TVD}(p(y | x_o, z), \bar{p}(y | x_o))]$
- Low $V$ : The prediction is stable regardless of how the missing modality is "completed." The observed modality is sufficient.
- High $V$ : The prediction varies significantly based on the latent completion. The missing modality is critical for this specific instance.
Plausible Labels: By clustering the output logits from the $K$ latent samples, PRIMO visualizes a set of "plausible labels." If clusters are spread across multiple classes, the missing modality creates ambiguity; if they converge on one class, the prediction is robust.

3. Key Contributions

Unified Framework: PRIMO is the first method to jointly optimize a discriminative objective while supporting both complete and partially observed modalities during training and inference without discarding data.
Instance-Level Impact Analysis: Unlike previous methods that provide a single prediction, PRIMO provides a distribution of predictions and a metric ( $V$ ) to quantify how much a missing modality influences the decision for a specific instance.
No Reconstruction Required: By focusing on the latent variable relevant to the label rather than reconstructing the raw missing input, the model avoids the pitfalls of generative imputation where irrelevant variations are learned.
Diagnostic Tool: The method can be used even when all modalities are present to detect if a multimodal model is relying on "shortcuts" (ignoring one modality) by checking if $V_{missing}$ is low even when $x_m$ is available.

4. Experimental Results

The authors evaluated PRIMO on three datasets: a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III (healthcare).

Synthetic XOR Dataset:
- PRIMO matched the unimodal baseline when the second modality was missing and the multimodal baseline when both were present.
- The metric $V$ correctly identified that for $x_o < 0$ , the missing modality was critical (high $V$ ), while for $x_o > 0$ , it was irrelevant (low $V$ ).
Audio-Vision MNIST:
- PRIMO achieved accuracy comparable to unimodal baselines when a modality was missing and multimodal baselines when complete.
- Insight: Missing vision resulted in high variance ( $V$ ), indicating vision is crucial. Missing audio resulted in low variance for many instances, suggesting audio is often redundant for digit recognition in this specific setup.
- Clustering analysis showed that high- $V$ examples yielded multiple plausible label clusters, while low- $V$ examples converged on a single label.
MIMIC-III (Healthcare):
- Mortality Prediction: Static features (demographics) were often sufficient (low $V$ ), but for older patients or high-risk cases, the time-series modality became critical (high $V$ ).
- ICD-9 Neoplasms (Cancer): Static features were sufficient; time-series data added little value (low $V$ ).
- ICD-9 Respiratory Diseases: Time-series data was essential. Missing it resulted in high $V$ and ambiguous predictions, as respiratory status is dynamic.
- Bias Analysis: PRIMO's predictions remained close to the "oracle" (Bayes-optimal) predictors for both unimodal and multimodal settings, validating the model's ability to recover optimal performance boundaries.

5. Significance and Conclusion

PRIMO addresses a critical gap in multimodal learning: the inability to handle incomplete data effectively while understanding why a modality matters.

Practical Utility: In high-stakes fields like healthcare, it helps clinicians understand when a specific test (modality) is necessary for a specific patient (instance) versus when it can be safely omitted to save cost and risk.
Heterogeneity: The results highlight that modality importance is not static; it varies significantly across tasks and even across individual examples within the same dataset.
Future Direction: The paper advocates for benchmarks that include heterogeneous missingness patterns and more than two modalities to further validate these instance-level insights.

In summary, PRIMO shifts the paradigm from "imputing missing data to fill a gap" to "quantifying the uncertainty and impact of missing data," providing a principled, interpretable, and performant solution for real-world multimodal applications.