Towards Multimodal Domain Generalization with Few Labels

Imagine you are training a team of detectives to solve crimes. These detectives have two special senses: Eyes (Video) and Ears (Audio).

In the real world, these detectives face two big problems:

The "New City" Problem (Domain Shift): You train them in a sunny, quiet studio. But when you send them to a dark, noisy street, they get confused because the lighting and background noise are totally different.
The "Budget" Problem (Few Labels): You can only afford to pay a human expert to write down the answers for a few cases. For the thousands of other cases, you have no answers (unlabeled data).

Most current AI methods fail here. Some are great at handling new cities but need a million labeled cases to learn. Others can learn from unlabeled data but get confused when the environment changes. And some only work if the detective has both eyes and ears, failing if one sense is missing.

This paper introduces a new, smarter way to train these detectives called SSMDG (Semi-Supervised Multimodal Domain Generalization). It's like a master training manual that teaches the team to learn from a few experts, adapt to new cities, and even function if one sense is temporarily blocked.

Here is how their new training camp works, using three creative analogies:

1. The "Trustworthy Team Huddle" (Consensus-Driven Consistency)

When the detectives look at a mystery case without an expert answer, they have to guess.

The Old Way: If the "Eye" detective says "It's a cat" and the "Ear" detective says "It's a dog," the team gets confused and ignores the case.
The New Way: The team only trusts a guess if both the fused team (Eyes + Ears) and at least one individual detective are confident and agree.
- Example: If the team says "It's a dog" and the Ear detective also says "It's a dog," they mark it as a "True Fact" and use it to learn. This ensures they don't learn from wild guesses.

2. The "Safe Learning Zone" for Confused Cases (Disagreement-Aware Regularization)

What about the cases where the Eye says "Cat" and the Ear says "Dog"? The old method would just throw these cases away.

The New Way: The team realizes these "confused" cases are actually valuable! Instead of forcing them to agree immediately, they use a special "noise-tolerant" learning technique.
- Analogy: Imagine a teacher who knows a student is confused. Instead of punishing them for the wrong answer, the teacher says, "Okay, let's look at this again, but be gentle with the mistakes." This allows the team to learn from the messy, ambiguous data without getting corrupted by bad guesses.

3. The "Universal Translator" (Cross-Modal Prototype Alignment)

To make sure the detectives work in any city (Domain Generalization) and even if one sense is broken (Missing Modality), they need a shared mental map.

The Shared Map: The team creates "Prototypes" (mental anchors) for every type of object (e.g., a "Dog" anchor). They force the visual features of a dog and the audio features of a dog to point to the exact same spot on the map, regardless of whether they are in a studio or a street.
The Translator: If the detective arrives in a new city and their "Ears" are covered (missing audio), the "Eyes" can use a Translator to imagine what the sound should have been.
- Analogy: It's like a blind person who can "hear" a picture because their brain has learned to translate visual shapes into sound patterns. This keeps the detective working even if data is missing.

The Results

The authors tested this new training method on two real-world datasets (one with kitchen videos, one with human/animal/cartoon actions).

The Outcome: Their method crushed the competition. While other methods struggled when labels were scarce or the environment changed, this new approach got significantly better at guessing correctly.
Bonus: Even when they simulated "missing" video or audio during the test, the system recovered beautifully using its translator, whereas other methods crashed.

Why This Matters

In the real world, data is messy, expensive to label, and often incomplete. This paper provides a blueprint for building AI that is robust (handles new environments), efficient (learns from few examples), and resilient (works even when parts of the data are missing). It's a giant leap toward making AI that can actually survive in the unpredictable real world, not just in a controlled lab.

1. Problem Definition: Semi-Supervised Multimodal Domain Generalization (SSMDG)

The paper introduces a novel problem setting called Semi-Supervised Multimodal Domain Generalization (SSMDG). This setting addresses the intersection of three critical challenges in real-world machine learning:

Multimodal Learning: Utilizing data from multiple modalities (e.g., video and audio).
Domain Generalization (DG): Learning models that generalize to unseen target domains where the data distribution differs from the training sources.
Semi-Supervised Learning (SSL): Learning effectively with a very small number of labeled samples while leveraging abundant unlabeled data.

The Gap: Existing methods fail to address this specific combination:

Multimodal Domain Generalization (MMDG) methods assume fully labeled data and cannot utilize unlabeled samples.
Semi-Supervised Multimodal Learning (SSML) methods leverage unlabeled data but ignore domain shifts, leading to poor generalization.
Semi-Supervised Domain Generalization (SSDG) methods handle domain shifts with few labels but are restricted to single-modality inputs, failing to capture cross-modal interactions.

Goal: To train a robust predictive model using multi-source multimodal data with only a few labeled samples per class, capable of generalizing to unseen target domains without adaptation.

2. Methodology

The authors propose a unified framework designed to tackle two core challenges: (1) identifying reliable pseudo-labels from unlabeled data despite low confidence and inter-modality disagreements, and (2) learning representations that are invariant to both domain and modality shifts. The framework consists of three key components:

A. Consensus-Driven Consistency Regularization (CDCR)

This component focuses on generating reliable pseudo-labels for unlabeled data.

Mechanism: It employs a modality-consensus selection strategy. An unlabeled sample is selected for supervision only if:
1. The fused multimodal prediction (from weak augmentation) is confident (exceeds a threshold $\tau$ ).
2. The fused prediction agrees with at least one unimodal prediction (video or audio).
3. The agreeing unimodal prediction is also confident.
Loss: For these "consensus" samples, a standard Cross-Entropy (CE) loss is applied between the pseudo-label and the predictions of the strongly augmented views across all modalities. This ensures consistency only when the model is highly confident.

B. Disagreement-Aware Regularization (DAR)

This component addresses the limitation of CDCR discarding potentially informative samples where modalities disagree.

Mechanism: It targets the "non-consensus" set—samples with high fused confidence but where unimodal predictions diverge.
Loss: Instead of standard CE, it uses Generalized Cross-Entropy (GCE) loss. GCE is known for its robustness to noisy labels. By applying GCE to these ambiguous samples, the framework utilizes the fused pseudo-label as a supervisory signal while mitigating the risk of error propagation from noisy labels.

C. Cross-Modal Prototype Alignment (CMPA)

This component enforces feature-level invariance across domains and modalities.

Prototypes: The model maintains running-average class prototypes for each modality and domain.
Alignment: It aligns features from both labeled and pseudo-labeled data to:
1. Intra-domain prototypes: Features of a specific class within the current domain.
2. Cross-domain prototypes: The average of prototypes from other source domains.
Cross-Modal Translation: To handle missing modalities (a common real-world issue), the framework includes modality translators ( $t_{v \to a}$ and $t_{a \to v}$ ). These translate features from one modality to another (e.g., synthesizing audio features from video). The translated features are also aligned with the prototypes, ensuring robustness even when a modality is missing during inference.

Overall Objective: The total loss function combines supervised loss on labeled data, CDCR, DAR, and CMPA, weighted by hyperparameters.

3. Key Contributions

New Problem Formulation: The paper formally defines SSMDG, unifying the previously disjoint challenges of generalization and data efficiency in multimodal learning.
Comprehensive Benchmark: The authors established the first SSMDG benchmarks using adapted versions of the EPIC-Kitchens and HAC datasets. These benchmarks cover various label regimes (e.g., 5 labels per class, 5% labels) and missing-modality scenarios.
Novel Framework: The proposed unified framework effectively bridges the gap between existing paradigms by jointly exploiting unlabeled data, learning domain-invariant representations, and handling modality disagreements.
State-of-the-Art Performance: The method consistently outperforms strong baselines across standard and missing-modality scenarios.

4. Experimental Results

The framework was evaluated on HAC (Human, Animal, Cartoon domains) and EPIC-Kitchens (three kitchen environments) datasets.

Main Results (5 labels/class):
- On the HAC dataset, the proposed method achieved a mean accuracy of 60.77%, significantly outperforming the best baseline (STiL, an SSML method) at 58.34%.
- On EPIC-Kitchens, it achieved 39.94%, outperforming the best baseline (NIED-LRM, an SSDG method) at 37.12%.
- The method demonstrated robustness across different label ratios (5%, 10%) and modality combinations (2-modality and 3-modality including flow).
Missing Modality Robustness:
- When video or audio was missing at test time, the Cross-Modal Translation strategy significantly outperformed naive zero-filling. For example, with missing video, the translation strategy improved accuracy by 7.59% over zero-filling on the HAC dataset.
Ablation Studies:
- CDCR + DAR: Combining both regularization strategies yielded the best results, proving that leveraging both confident consensus and ambiguous disagreement is crucial.
- CMPA: Removing cross-domain alignment or cross-modal translation led to significant performance drops, confirming the necessity of feature invariance and modality synthesis.
- Pseudo-Label Quality: The framework achieved higher pseudo-label accuracy and better utilization of unlabeled data compared to baselines like FixMatch and STiL.

5. Significance

This work is significant because it addresses a highly practical and under-explored scenario: deploying multimodal AI in new environments with scarce annotations.

Practicality: It solves the "annotation bottleneck" in multimodal learning while ensuring the model works in unseen environments (domain shift).
Robustness: By handling missing modalities via translation, the model is more resilient to real-world sensor failures or data incompleteness.
Foundation for Future Research: By establishing the first SSMDG benchmarks and a strong baseline, the paper provides a foundation for future research into efficient, robust, and generalizable multimodal systems.

In summary, the paper presents a sophisticated solution that balances consensus (for reliability) and disagreement (for information utilization) while enforcing invariance across domains and modalities, setting a new standard for data-efficient multimodal generalization.