Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Imagine you have a super-smart robot assistant that was trained in a perfect, sunny classroom (the Source Domain) to recognize objects using both its eyes (video) and ears (audio). It's great at its job.

But then, you send this robot out into the real world (the Test Time). Suddenly, the weather changes, the camera gets foggy, or the microphone picks up loud construction noise. The robot's world has shifted, and its old training doesn't fit anymore. This is the problem of Multi-Modal Test-Time Adaptation (TTA): How do we help a robot learn on the fly without forgetting what it already knows?

The paper introduces a new method called DASP (Decoupling Adaptation for Stability and Plasticity). Here is how it works, explained through simple analogies.

The Problem: The "All-or-Nothing" Mistake

Previous methods tried to fix the robot by updating everything at once. Imagine the robot is wearing a pair of glasses (video) and headphones (audio).

Scenario A: The glasses get smudged with mud (video is corrupted), but the headphones are crystal clear.
The Old Way: The robot tries to relearn both the glasses and the headphones simultaneously.
- The Result: Because the headphones were already perfect, trying to "relearn" them just confuses the robot. It starts making mistakes on things it used to know well. This is called Negative Transfer.
- The Result (Part 2): Because the robot is constantly changing its brain to adapt to the mud, it eventually forgets how to recognize things in clean conditions. This is called Catastrophic Forgetting.

The robot is stuck in a dilemma: It needs to be Plastic (flexible enough to learn new things) but also Stable (rigid enough to keep old knowledge).

The Solution: DASP (The "Diagnose-then-Mitigate" Framework)

DASP solves this by acting like a smart mechanic who doesn't just hammer everything; they first inspect the car, then fix only the broken parts.

Step 1: Diagnosis (The "Redundancy Score")

How does the robot know which sensor is broken?

The Old Way: It looks at how "confident" the robot feels. But sometimes, a broken sensor can still feel very confident (like a person shouting confidently while giving the wrong answer).
The DASP Way: It looks at the internal structure of the data.
- Analogy: Imagine a choir. In a healthy choir, every singer has a unique voice (low redundancy). If the choir starts singing the exact same note over and over because they are confused by noise, that's high redundancy.
- DASP checks the "choir" of the video data and the "choir" of the audio data. If one of them starts sounding repetitive and chaotic (high redundancy), DASP knows, "Aha! That sensor is corrupted!" It ignores the other sensor, which is still singing beautifully.

Step 2: Mitigation (The "Asymmetric Adaptation")

Once DASP knows which sensor is broken, it uses a special two-part toolkit for each sensor:

The Stable Adapter (The Anchor): This part holds the robot's core knowledge. It's like the foundation of a house.
The Plastic Adapter (The Sponge): This part is flexible and ready to soak up new information.

Here is the magic trick:

For the Broken Sensor (The Mud on the Glasses): DASP activates the Sponge. It lets this part change and learn to see through the mud. The Anchor stays frozen so the robot doesn't forget how to see clearly in the first place.
For the Good Sensor (The Crystal Clear Headphones): DASP turns off the Sponge. It tells the Anchor, "You are doing great, just keep doing exactly what you're doing." It even adds a "safety belt" (mathematical regularization) to make sure the Anchor doesn't accidentally drift away from its original, perfect training.

Why This is a Big Deal

Think of it like a student taking a test in a noisy room.

Old Methods: The student tries to ignore the noise by shouting louder and changing their entire study strategy. They end up forgetting the math formulas they knew perfectly.
DASP: The student realizes, "The noise is only affecting my hearing, not my vision." So, they put on noise-canceling headphones (Plastic Adapter) to handle the noise, but they keep their math textbook (Stable Adapter) exactly as it is, refusing to change a single formula.

The Results

The authors tested this on video and audio datasets where they intentionally "corrupted" the data (added noise, blur, etc.).

DASP consistently outperformed all other methods.
It prevented the robot from forgetting its original skills (Stability).
It allowed the robot to learn the new, messy environment quickly (Plasticity).
It did all this without slowing down the robot or needing extra memory.

Summary

DASP is a smart system that:

Detects which part of the robot's senses is broken by looking for "repetitive confusion" in the data.
Fixes only the broken part using a flexible "learning sponge."
Protects the healthy parts by freezing their "knowledge anchors."

It's the difference between trying to rebuild your whole house because one window is cracked, versus just replacing that one window while keeping the rest of the house strong and safe.

1. Problem Statement

The paper addresses the challenge of Multi-Modal Test-Time Adaptation (MM-TTA), where pre-trained multi-modal models (e.g., audio-video) must adapt to evolving, unseen test-time distributions without access to source data or ground truth labels.

Existing methods face a fundamental stability-plasticity dilemma in this context:

Negative Transfer: Current "modality-agnostic" strategies adapt all modalities indiscriminately. This often harms "unbiased" (clean) modalities by overfitting to noise in the unsupervised signal, degrading overall performance.
Catastrophic Forgetting: Continuous parameter updates to adapt to a "biased" (corrupted) modality often overwrite knowledge learned from the source domain, causing the model to forget how to handle the original, clean data.

The core challenge is distinguishing which modality is corrupted and adapting to it without destabilizing the clean modalities or erasing prior knowledge.

2. Methodology: DASP Framework

The authors propose DASP (Decoupling Adaptation for Stability and Plasticity), a "diagnose-then-mitigate" framework.

A. Diagnosis via Redundancy Score

Traditional metrics like entropy or confidence are unreliable for diagnosing distribution shifts in multi-modal settings because a dominant modality (e.g., video in an audio-visual task) may naturally have lower entropy than an auxiliary one, even when corrupted.

DASP introduces a non-parametric Interdimensional Redundancy Score ( $R$ ):

Insight: Distribution shifts cause feature dimensions to become spuriously correlated (redundant) rather than independent.
Mechanism: The method calculates the normalized covariance matrix of feature embeddings within a batch. The redundancy score $R(Z)$ quantifies the sum of squared correlations between feature dimensions.
Diagnosis Rule: A modality is identified as "biased" (corrupted) if its redundancy score is significantly higher than the minimum redundancy score among all modalities in the batch. This allows the system to dynamically identify which modality requires adaptation.

B. Mitigation via Asymmetric Adaptation

Once the biased modality is identified, DASP employs an asymmetric adaptation strategy using a decoupled adapter architecture for each modality. Each adapter is split into two components:

Plastic Adapter ( $\phi_p$ ): High-rank structure designed to capture complex, domain-specific information.
Stable Adapter ( $\phi_s$ ): Low-rank structure designed to learn domain-invariant, generalizable features.

The Asymmetric Strategy:

For Biased Modalities (Requires Plasticity):
- The Plastic Adapter is activated and updated to capture the new domain shift.
- The Stable Adapter remains frozen to preserve source knowledge.
For Unbiased Modalities (Requires Stability):
- The Plastic Adapter is bypassed (inactive).
- The Stable Adapter is updated using KL Regularization ( $L_{kl}$ ) to ensure predictions remain close to the source distribution, preventing negative transfer.

Optimization Objective:
The total loss combines:

Entropy Minimization ( $L_{ent}$ ): To encourage confident predictions on the target domain.
Diversity Regularization ( $L_{div}$ ): To prevent model collapse (predicting a single class).
KL Regularization ( $L_{kl}$ ): Applied only to unbiased modalities to anchor them to the source domain.

3. Key Contributions

Novel Framework: Proposes DASP, the first MM-TTA method to explicitly decouple stability and plasticity based on a diagnosis of modality bias.
Redundancy Metric: Introduces the Interdimensional Redundancy Score as a robust, source-free indicator for detecting distribution shifts, overcoming the limitations of entropy-based metrics in multi-modal settings.
Asymmetric Architecture: Designs a dual-component adapter (Stable/Plastic) with an asymmetric update mechanism that selectively updates parameters based on modality health, effectively solving the negative transfer and catastrophic forgetting trade-off.
Empirical Validation: Demonstrates superior performance across episodic and continual adaptation scenarios on diverse benchmarks.

4. Experimental Results

The authors evaluated DASP on Kinetics50-C and VGGSound-C benchmarks with various audio and video corruptions (e.g., noise, blur, weather).

Episodic Adaptation: DASP outperformed State-of-the-Art (SOTA) methods (Tent, EATA, SAR, READ, TSA).
- Achieved +1.6% average accuracy improvement on Kinetics50-C (video corruption) and +5.0% on VGGSound-C (audio corruption) compared to the previous best.
Continual Adaptation: In scenarios where corruptions alternate or persist over time, DASP showed significantly better robustness against catastrophic forgetting.
- In interleaved corruption scenarios (switching between audio and video corruption), DASP achieved 4.4% and 1.5% gains over SOTA on the two benchmarks, respectively.
Ablation Studies:
- Removing the Stable Adapter led to negative transfer on clean modalities.
- Removing the Plastic Adapter hindered adaptation to corrupted modalities.
- Reversing the asymmetric strategy (updating stable on biased, plastic on unbiased) caused a ~6.5% performance drop, validating the specific design logic.
Efficiency: DASP maintains high inference speed and low memory overhead, comparable to or better than existing TTA methods.

5. Significance

This work provides a critical theoretical and practical solution to the stability-plasticity dilemma in multi-modal learning. By recognizing that different modalities have different needs during distribution shifts (some need to change, others need to stay fixed), DASP moves beyond "one-size-fits-all" adaptation.

The introduction of interdimensional redundancy as a diagnostic tool offers a new perspective for analyzing feature representations in the absence of labels. The proposed asymmetric strategy sets a new standard for robust multi-modal systems operating in open-world, non-stationary environments, ensuring that models can adapt to new conditions without losing their foundational capabilities.