Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation

Imagine you are a multilingual tour guide who has spent years leading groups through a beautiful, sunny city (the "Source Domain"). You know every street, every landmark, and how to explain things perfectly in that specific environment.

Now, suddenly, you are dropped into a foggy, rainy, and noisy version of that same city (the "Target Domain"). To make things worse, your tour group is split into two teams: one team is wearing noise-canceling headphones (Audio modality), and the other is wearing foggy goggles (Visual modality).

The Problem: The "Confused Guide"

In the real world, AI models face this exact problem. When an AI trained on clean data meets messy, real-world data (like a video with bad audio or a blurry image), it gets confused.

Existing methods try to fix this by:

Ignoring the bad parts: "Hey, the audio is terrible, let's just trust the video!" (But what if the video is also blurry?)
Trying to fix everything at once: "Let's adjust the whole guide's brain to handle the rain and noise simultaneously." This often leads to a "muddled" brain where the guide forgets how to speak clearly in either language.

The paper argues that in a multimodal world (where AI sees and hears), the problem is a double whammy:

The Shallow Shift: The raw data is noisy (foggy goggles).
The Deep Misalignment: Because the audio and video are both messed up differently, they stop "talking" to each other properly. The guide hears "dog" but sees a "cat," and gets stuck in a loop of confusion.

The Solution: BriMPR (The "Progressive Re-alignment" Strategy)

The authors propose a new method called BriMPR (Bridging Modalities via Progressive Re-alignment). Think of this as a two-step rescue plan for our confused tour guide.

Step 1: The "Personalized Earplugs and Goggles" (Prompt Tuning)

Instead of rebuilding the whole guide's brain, BriMPR gives the guide customizable, tiny accessories (called "prompts") for each sense.

The Analogy: Imagine the guide puts on a special pair of smart goggles for the visual team and smart earplugs for the audio team. These aren't just filters; they are learnable tools that gently nudge the guide's brain to say, "Hey, even though it's foggy, remember what a 'tree' looks like in the sunny city."
The Magic: The guide uses these accessories to calibrate each sense individually. It forces the "foggy vision" to look more like the "sunny vision" it learned in training, and the "noisy audio" to sound like the "clean audio."
The Result: Suddenly, the guide isn't confused by the fog. The visual team and audio team are now both speaking the same "language" again, even though the environment is still weird.

Step 2: The "Blindfold Game" (Cross-Modal Interaction)

Now that the senses are calibrated, the guide needs to learn how to trust each other again in this messy environment. BriMPR plays a clever game:

The Analogy: The guide is told to blindfold the visual team (mask the video) and ask the audio team to guess the scene. Then, it blindfolds the audio team and asks the visual team to guess.
The Twist: Because the guide has already calibrated the senses in Step 1, the "seeing" team can actually help the "hearing" team, and vice versa. They create pseudo-labels (best guesses) for each other.
The Lesson: By forcing the teams to fill in the gaps for each other, they learn to rely on the group rather than just their own shaky senses. This strengthens the bond between the audio and video, ensuring they stay aligned even when the world gets chaotic.

Why This is a Big Deal

Most previous methods tried to fix the whole system at once, which often made things worse (like trying to fix a car engine while driving it at 100mph).

BriMPR is different because:

It breaks the problem down: It fixes the audio first, then the video, then makes them work together. It's a "divide and conquer" strategy.
It's efficient: It doesn't retrain the whole AI. It just adds those tiny, smart "accessories" (prompts) to the existing model.
It works in the real world: The paper tested this on datasets where videos were corrupted by snow, noise, or blur, and on real-world sentiment analysis tasks. In almost every case, BriMPR outperformed the state-of-the-art methods, keeping the AI accurate even when the data was a mess.

The Bottom Line

BriMPR is like giving a confused tour guide a set of smart, adjustable tools. Instead of panicking when the weather changes, the guide uses these tools to recalibrate their senses one by one, then teaches the senses to help each other. The result? The guide can still lead the tour perfectly, even in the middle of a storm.

1. Problem Definition

The paper addresses Multimodal Test-Time Adaptation (MMTTA), a scenario where a pre-trained multimodal model must adapt to an unseen target domain using only unlabeled test data.

The Core Challenge: In multimodal settings, different modalities (e.g., audio and video) often experience varying degrees of distribution shift. This creates a complex coupling effect:
1. Unimodal Shallow Feature Shift: Individual modalities drift from their source distributions.
2. Cross-Modal High-Level Semantic Misalignment: Because the individual modalities shift differently, their fused representations become entangled and lose discriminability.
Limitations of Existing Methods: Current TTA methods (designed for unimodal tasks) fail to handle this coupling. They either ignore the shallow feature shifts of individual modalities or fail to correct the semantic misalignment between modalities, leading to performance degradation when one or both modalities are corrupted.

2. Methodology: BriMPR

The authors propose BriMPR (Bridging Modalities via Progressive Re-alignment), a framework that adopts a divide-and-conquer strategy. It decomposes the complex MMTTA problem into two progressive modules:

Module A: Prompt-driven Modality-specific Global Feature Alignment (PMGFA)

Goal: Achieve initial cross-modal semantic alignment by correcting the distribution shift of each unimodal feature independently.
Mechanism:
- The authors model source and target feature distributions as multivariate Gaussian distributions.
- Instead of matching full covariance matrices (which is error-prone in high dimensions), they propose matching only the mean and diagonal variance (first and second moments).
- Theorem 1 proves that estimating only diagonal elements reduces the mean squared error of the covariance estimation from $O(d^2/n)$ to $O(d/n)$ .
- Prompt Tuning: To map target features back to the source distribution without retraining the whole model, the method inserts learnable modality-specific prompts into the transformer layers of each encoder. These prompts act as universal approximators to calibrate the global feature distribution.
Loss: Minimizes the discrepancy between the estimated target statistics and pre-computed source statistics.

Module B: Inter-modal Interaction Enhancement for Alignment Refinement

Goal: Refine the alignment by leveraging the interaction between modalities and ensuring consistency at the instance level.
Mechanism 1: Cross-modal Masked Embedding Recombination (CMER)
- Inspired by masked modeling, the method randomly masks patches in one modality (e.g., audio) and recombines the masked embedding with the complete embedding of the other modality (e.g., video).
- This simulates data augmentation where one modality is corrupted.
- Pseudo-labeling: The model uses the complete multimodal data (which is more reliable after PMGFA) to generate pseudo-labels for these augmented inputs.
- Adaptive Temperature: A temperature scaling factor ( $AdaT_p$ ) is introduced to prevent overconfident pseudo-labels when distribution discrepancy is high.
- Weighting: The loss assigns higher weights to augmentations where the masked modality has a milder shift, forcing the corrupted modality to learn from the clean one.
Mechanism 2: Inter-modal Instance-wise Contrastive Learning (LIICL)
- Treats the unimodal representations of the same instance (e.g., audio-only and video-only features of the same video) as positive pairs.
- Maximizes the cosine similarity between these pairs to enforce instance-level alignment across modalities.

Total Objective: The model optimizes a weighted sum of the three losses:
$\mathcal{L}_{BriMPR} = \mathcal{L}_{PMGFA} + \mathcal{L}_{CMER} + \mathcal{L}_{IICL}$
Only the prompts are updated; the encoders, joint module, and classifier remain frozen.

3. Key Contributions

Novel Framework: Proposed BriMPR, the first method to tackle MMTTA by explicitly decoupling unimodal feature shifts and cross-modal misalignment via a divide-and-conquer approach.
Efficient Calibration: Leveraged the function approximation ability of prompt tuning to efficiently calibrate unimodal global feature distributions, avoiding the computational and statistical costs of full covariance matrix estimation.
Interaction Strategy: Introduced a novel Cross-modal Masked Embedding Recombination strategy and Inter-modal Instance-wise Contrastive Learning to enhance information interaction and refine alignment.
Comprehensive Evaluation: Demonstrated superiority over State-of-the-Art (SOTA) methods across corruption-based benchmarks (Kinetics50-C, VGGSound-C) and real-world domain shift datasets (CMU-MOSI, CH-SIMS).

4. Experimental Results

The method was evaluated on four datasets under three settings: Unimodal Shift, Multimodal Shift, and Real-world Shift.

Unimodal Shift (One modality corrupted):
- On Kinetics50-C (video corrupted), BriMPR improved accuracy from 60.5% (Source) to 65.9%, outperforming the previous best (READ) by a significant margin.
- On VGGSound-C (audio corrupted), it improved from 25.0% to 36.5%.
Multimodal Shift (Both modalities corrupted):
- BriMPR achieved the highest average accuracy in almost all corruption types (e.g., Noise, Blur, Weather), significantly outperforming methods like Tent, EATA, and READ. It showed robustness even when both modalities were severely degraded.
Real-world Shift:
- In the MOSI $\to$ SIMS task, BriMPR was the only method to achieve accuracy above random guessing (>50%), reaching 58.2%, whereas other methods failed (e.g., READ at 32.4%).
Efficiency & Ablation:
- BriMPR is parameter-efficient, updating only ~0.169M parameters (prompts) compared to full fine-tuning.
- Ablation studies confirmed that removing PMGFA, CMER, or LIICL leads to performance drops, validating the necessity of the progressive re-alignment strategy.
- The method remains effective even with limited adaptation data (data efficiency).

5. Significance

This paper makes a significant contribution to the field of robust multimodal learning:

Theoretical Insight: It identifies and mathematically addresses the "coupling effect" of unimodal shifts and cross-modal misalignment, a root cause of failure in existing multimodal TTA methods.
Practical Utility: By using prompt tuning, it offers a lightweight, parameter-efficient solution suitable for real-time deployment in dynamic environments (e.g., autonomous driving, surveillance) where sensor data quality fluctuates.
Generalizability: The strategy of decoupling modalities for initial alignment followed by interaction refinement provides a new paradigm for handling complex distribution shifts in multi-source data, extending beyond just audio-visual tasks to other multimodal domains.