CoVAE: correlated multimodal generative modeling

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to understand a person's personality. You have two different sources of information: their voice (audio) and their face (video).

In the world of Artificial Intelligence, there are tools called VAEs (Variational Autoencoders) that try to learn a "summary" of this person so the computer can understand them. The problem arises when we have multiple sources (multimodal data) and some of them are missing.

The Problem: The "Overconfident Guess"

Most current AI models work like a group of experts trying to agree on a single answer.

If you show the AI a picture of a face, it tries to guess the voice.
If you show it the voice, it tries to guess the face.

The old way these models worked was to force the face and the voice into a single, tiny "summary box" (a latent space). They would mash the two together until they became one perfect, deterministic point.

Here is the flaw: Because the model forced them to be one perfect point, it became overconfident.

If you show the AI a blurry face, it should say, "I'm not sure what the voice sounds like; it could be anything!"
But the old models say, "I know exactly what the voice sounds like!" and generate a very specific, sharp voice.
The Reality: If the data is blurry or missing, the AI should be uncertain. The old models destroy this uncertainty, making them terrible at predicting missing information or knowing when they are guessing.

The Solution: CoVAE (The "Correlated" Model)

The authors of this paper introduce CoVAE (Correlated Variational Autoencoder).

Think of CoVAE not as a group of experts forcing an agreement, but as a smart detective who understands relationships.

It Keeps Them Separate but Connected: Instead of smashing the face and voice into one tiny box, CoVAE keeps them in two separate boxes but draws a rubber band between them.
The Rubber Band (Correlation): This rubber band represents how much the face and voice usually move together.
- If the rubber band is tight (high correlation), seeing the face gives you a very good idea of the voice.
- If the rubber band is loose (low correlation), seeing the face doesn't tell you much about the voice.
Smart Uncertainty: When the AI sees a blurry face, it looks at the rubber band.
- If the band is loose, it says, "I have no idea what the voice is," and generates a fuzzy, uncertain guess.
- If the band is tight, it says, "I'm pretty sure," and generates a clearer guess.

The "Magic" of the Experiment

The researchers tested this with two types of data:

1. The "Fake" Test (Synthetic Data):
They created a computer world where they knew exactly how much the "face" and "voice" were related (e.g., 50% related).

Old Models: Even when the data was only 50% related, the old models acted like they were 100% related. They made up fake, overly specific details.
CoVAE: It correctly learned the 50% relationship. When asked to guess the missing part, it gave a guess that was "fuzzy" in exactly the right way, matching the real uncertainty.

2. The "Real" Test (Medical Data):
They used real cancer data: mRNA (one type of genetic code) and miRNA (another type). These are like two different languages describing the same disease.

The goal was: "If we only have the mRNA, can we guess the miRNA?"
CoVAE was the best at this. It didn't just guess a random number; it understood the statistical link between the two genetic codes. It provided a guess that was accurate and knew how confident it should be.

The Big Picture

In simple terms, CoVAE teaches AI to be humble.

Old AI models are like arrogant students who always raise their hand and give a specific answer, even when they don't know the material. CoVAE is like a smart student who knows when to say, "I'm not 100% sure, but here is my best guess based on how these two things usually relate," and admits when the answer could be anything.

This is crucial for science and medicine, where knowing how uncertain a prediction is can be just as important as the prediction itself.

1. Problem Statement

Multimodal Variational Autoencoders (VAEs) are widely used to extract representations from data with multiple complementary modalities (e.g., images and text, or mRNA and miRNA). However, existing architectures suffer from a fundamental flaw known as the collapse of joint statistical structure.

The Core Issue: Most multimodal VAEs (e.g., Product-of-Experts, Mixture-of-Experts, or Joint Encoders) fuse different modalities into a single latent point or a diagonal distribution in the latent space.
The Consequence: When generating missing modalities based on observed ones, these models treat the relationship between modalities as deterministic. They assume maximal mutual information between modalities, regardless of the actual data correlation.
Impact on Uncertainty: This leads to severe miscalibration of uncertainty. For example, if a model observes one modality and imputes another, standard models assign the same low uncertainty to the imputed data as they do to the observed data. In reality, if the correlation is weak, the uncertainty for the missing modality should be high. This failure is critical in scientific domains (like biomedicine) where accurate uncertainty quantification is essential.

2. Methodology: CoVAE Architecture

The authors propose Correlated Variational Autoencoders (CoVAE), a generative architecture designed to explicitly model and preserve the correlations between modalities.

Key Architectural Components

Separate Encoders: Each modality $k$ is encoded independently into a $d$ -dimensional latent space $z_k$ using a standard VAE encoder with a diagonal covariance.
Concatenated Latent Space: The individual latent variables are concatenated into a single vector $z \in \mathbb{R}^{dK}$ .
Non-Diagonal Prior: Unlike standard VAEs that use a standard isotropic Gaussian prior $N(0, I)$ , CoVAE employs a multivariate Gaussian prior $p(z) = N(0, \Sigma_{prior})$ with a non-diagonal covariance matrix. This matrix explicitly stores the cross-modality correlations.
Joint Encoder: A joint encoder $q_\phi(z|x)$ produces a full-covariance multivariate Gaussian $N(\mu, \Sigma_{joint})$ when all modalities are present.
Conditional Inference: When a subset of modalities is missing, the model infers the latent variables for the missing modalities by sampling from the conditional distribution derived from the prior:
$z_M | z_O \sim N(\Sigma_{MO}\Sigma_{OO}^{-1}z_O, \Sigma_{MM} - \Sigma_{MO}\Sigma_{OO}^{-1}\Sigma_{OM})$
This allows the model to correctly propagate uncertainty: if the correlation is low, the conditional variance remains high.

Training Strategy

The training process involves two main objectives:

Joint Loss ( $L_{joint}$ ): Minimizes the Evidence Lower Bound (ELBO) for the joint encoder when all modalities are present.
Conditional Loss ( $L_{cond}$ ): For each modality $k$ , the model samples the missing latents $z_{-k}$ from the conditional prior given $z_k$ , reconstructs all modalities, and minimizes the reconstruction loss.
Prior Learning: Instead of learning the prior covariance $\Sigma_{prior}$ end-to-end (which can be unstable), the authors propose pre-training the correlations using Deep Canonical Correlation Analysis (Deep CCA) to initialize $\Sigma_{prior}$ , which is then frozen during the main training phase.

3. Key Contributions

Identification of the "Deterministic Collapse": The paper formally identifies that standard fusion strategies in latent space destroy the joint statistical structure, leading to overconfident generation and incorrect uncertainty estimates.
Novel Architecture: Introduction of CoVAE, which utilizes a non-diagonal Gaussian prior to encode inter-modality correlations, enabling realistic conditional generation and uncertainty quantification.
Theoretical and Empirical Validation: The authors demonstrate that CoVAE is the only architecture capable of recovering the true correlation structure of synthetic data, whereas competitors (PoE, MoE, JMVAE) fail to do so.
Real-World Application: Successful application to a complex biomedical dataset (Pan-Cancer mRNA/miRNA), showing competitive performance in classification and superior performance in conditional reconstruction tasks.

4. Experimental Results

Synthetic Datasets (MNIST Pairs)

Correlation Recovery: CoVAE was the only model capable of reconstructing data with the correct linear correlation coefficient ( $\rho$ ). Competitors either produced maximal correlation ( $\rho=1$ ) or constant, hyperparameter-dependent correlations regardless of the ground truth.
Uncertainty Quantification:
- In conditional generation (observing one modality, imputing the other), CoVAE correctly assigned higher uncertainty (wider posterior) to the missing modality when correlations were low.
- Competitors assigned the same low uncertainty to both observed and missing modalities, failing to reflect the lack of information.
Visual Quality: At intermediate correlations, CoVAE generated "fainter" but more recognizable digits, reflecting the true uncertainty, whereas other models generated sharp but often incorrect digits.

Biomedical Dataset (TCGA Pan-Cancer)

Data: 8,314 samples with paired mRNA and miRNA features.
Correlation: CoVAE learned a strong prior correlation ( $\rho = 0.78$ ) between the latent representations of the two modalities.
Classification: CoVAE achieved competitive results in cancer type classification (Precision, NMI, ARI), comparable to Product-of-Experts models.
Conditional Reconstruction: CoVAE outperformed most models in reconstructing missing modalities (e.g., predicting miRNA from mRNA) and achieved the lowest Negative Log-Likelihood (NLL) for conditional tasks, indicating superior statistical fidelity.
Feature-Level Fidelity: CoVAE maintained high Spearman correlations between reconstructed and true values across all settings, a metric where it tied with MoPoE and JMVAE.

5. Significance and Limitations

Significance:

Scientific Rigor: CoVAE addresses a critical gap in scientific generative modeling where uncertainty quantification is as important as point estimates. It prevents the "hallucination" of high-confidence data when information is missing.
Generative Fidelity: It enables the generation of synthetic multimodal data that respects the true statistical dependencies of the source data, which is crucial for downstream tasks like data augmentation or simulation.

Limitations:

Gaussian Assumption: The model assumes correlations can be modeled as a global linear correlation in a Gaussian space, which may not hold for highly non-linear real-world data.
Performance Gap: CoVAE sometimes exhibits a slightly higher NLL (worse fit) compared to simpler models on joint tasks due to the "entropic price" of maintaining a non-diagonal covariance structure.
Scalability: Theoretically, the model requires training encoders for all $2^K$ subsets of modalities, though the authors note this is manageable for small $K$ (common in practice).
Manifold Issues: In low-correlation scenarios, the conditional generation may produce samples slightly outside the data manifold (e.g., blurry digits), which is statistically correct but visually less sharp.

In conclusion, CoVAE represents a significant step forward in multimodal generative modeling by prioritizing the preservation of statistical structure and uncertainty, making it particularly suitable for high-stakes scientific applications.