Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation

Imagine you are trying to teach a new student (the Target Model) how to recognize objects in a specific room (the Target Domain). However, you have two major problems:

The Original Teacher is a Black Box: You have a brilliant expert (the Source Model) who knows the subject perfectly, but they are locked in a glass room. You can't see their notes, their brain structure, or their private data. You can only ask them questions and get their answers.
The Room is Different: The new room looks different from the one the expert was trained in. The lighting is weird, the furniture is arranged differently, and the objects might look slightly distorted. If you just ask the expert for answers, they might get confused and give you wrong advice because the environment changed.

This is the problem of Black-Box Domain Adaptation. The paper proposes a clever solution called DDSR (Dual-Teacher Distillation with Subnetwork Rectification) to solve this. Here is how it works, broken down into simple steps:

1. The "Dual-Teacher" Strategy

Instead of relying on just the locked-up expert, the authors bring in a second teacher: CLIP (a powerful AI that has seen millions of pictures and text descriptions).

The Locked Expert (Black-Box Source): Knows the specific details of the subject but gets confused by the new room's weird lighting.
The Generalist (CLIP): Has seen everything in the world. It understands the concept of a "chair" or "dog" regardless of the lighting, but it might not know the specific style of the objects in this new room.

The Magic Trick: The system doesn't just pick one teacher. It creates a Smart Mixer.

If the new room is small (few samples), the system trusts the Locked Expert more because the Generalist might be too vague.
If the new room is large (many samples), the system trusts the Generalist more because the Locked Expert is likely making mistakes due to the weird lighting.
The Result: They combine their answers to create a "Super-Pseudo-Label" (a highly reliable guess) to teach the new student.

2. The "Subnetwork" Safety Net

Here's the tricky part: Even with two teachers, the advice might still be a little noisy or wrong. If the new student tries too hard to memorize these slightly wrong answers, they will fail the real test (this is called overfitting).

To fix this, the authors introduce a Subnetwork.

Analogy: Imagine the new student is the main athlete. The Subnetwork is a training partner who is slightly different (maybe they have a slightly different running style).
The system forces the main student and the training partner to run together. If the main student starts running off a cliff (learning from bad noise), the training partner pulls them back.
This "tug-of-war" ensures the student learns the true patterns of the room rather than just memorizing the teachers' mistakes.

3. The Two-Stage Training Process

Stage One: The Boot Camp

The student learns from the "Smart Mixer" (the combined advice of the two teachers).
The "Training Partner" (Subnetwork) keeps the student honest.
As the student gets better, their own answers start to look more reliable. The system uses these improved answers to fine-tune the Generalist (CLIP), teaching it to understand the specific quirks of this new room.

Stage Two: The Final Polish

By now, the student is pretty good. The system groups the objects the student has seen into "families" (called Prototypes).
If the student is unsure about an object, the system checks: "Which family does this look most like?"
It corrects any final mistakes and gives the student a final round of practice to become an expert.

Why is this a big deal?

Most previous methods tried to guess the answers using only the Locked Expert (who was confused) or only the Generalist (who was too vague).

This paper's approach is like having a team of experts who constantly check each other's work.

It works even when you can't see the original teacher's brain (privacy-friendly).
It works even when the new environment is totally different.
The Result: The new student performs better than methods that do have access to the original teacher's private data, proving that this "teamwork" approach is incredibly powerful.

In a nutshell: They built a system that combines a confused specialist and a knowledgeable generalist, uses a "training partner" to prevent bad habits, and refines the lessons over time, all without ever needing to see the original teacher's private notes.

1. Problem Statement

The paper addresses Black-Box Domain Adaptation (BBDA), a challenging scenario where:

Constraints: Neither the labeled source data nor the internal parameters/architecture of the pre-trained source model are accessible.
Input: The target model can only query the source model (black-box) to obtain predictions on unlabeled target samples.
Challenge: Direct transfer is difficult because distribution shifts cause the black-box source model to produce noisy, inaccurate predictions on target data. Existing methods often suffer from overfitting to this noisy supervision or fail to fully utilize high-level semantic priors.

2. Methodology: DDSR Framework

The authors propose Dual-Teacher Distillation with Subnetwork Rectification (DDSR), a two-stage framework that combines the specific knowledge of the black-box source model with the general semantic knowledge of a Vision-Language Model (ViL), specifically CLIP.

Stage One: Dual-Teacher Distillation & Subnetwork Rectification

This stage focuses on generating reliable pseudo-labels and training the target model while preventing overfitting.

Dual-Teacher Knowledge Distillation:
- Teachers: The black-box source model and a pre-trained CLIP model.
- Adaptive Prediction Fusion: Instead of fixed averaging, the method adaptively fuses predictions ( $\hat{y}_b$ $\overset{y}{^}_{b}$ from source, $\hat{y}_c$ $\overset{y}{^}_{c}$ from CLIP) based on prediction entropy (uncertainty) and target domain size ( $n_t$ $n_{t}$ ).
  - If the target domain is large ( $n_t > \tilde{n}_t$ ), CLIP's general semantic knowledge is weighted higher.
  - If the target domain is small, the source model's task-specific knowledge is weighted higher (counter-intuitive but empirically validated).
- Loss Functions:
  - KL Divergence ( $L_{kd}$ ): Aligns target predictions with the fused pseudo-labels.
  - Mixup Consistency ( $L_{mix}$ ): Ensures robustness by interpolating samples and predictions.
  - Information Maximization ( $L_{im}$ ): Encourages prediction diversity and certainty to prevent model collapse.
Subnetwork Rectification:
- To mitigate overfitting to noisy pseudo-labels, a lightweight subnetwork is initialized with a subset of the target model's parameters.
- Output Alignment ( $L_{od}$ ): Minimizes Jensen-Shannon divergence between the subnetwork and the full target network.
- Gradient Discrepancy ( $L_{wg}$ ): Enforces gradient divergence to ensure the subnetwork and full network learn complementary representations, acting as a regularizer.
- Self-Distillation & Prompt Tuning: The target model's predictions are used to iteratively refine pseudo-labels (via Exponential Moving Average) and fine-tune learnable CLIP prompts to better match the target domain.

Stage Two: Prototype-Based Self-Training

Prototype Extraction: Class-wise prototypes are computed based on the features and predicted labels from the Stage One model.
Label Correction: Target samples are reassigned to the class of their nearest prototype (using cosine distance).
Fine-tuning: The target model is further optimized using cross-entropy loss with these corrected, more accurate pseudo-labels.

3. Key Contributions

Dual-Teacher Adaptive Fusion: A novel mechanism that dynamically balances the specific knowledge of a black-box source model and the general semantics of CLIP based on target domain size and uncertainty, generating high-quality pseudo-labels.
Subnetwork Rectification Strategy: A regularization technique using a subnetwork to enforce output consistency and gradient divergence, effectively reducing overfitting to noisy supervision without requiring source data.
Iterative Refinement: A two-stage process where target predictions iteratively improve pseudo-labels and CLIP prompts, followed by prototype-based self-training for final semantic alignment.
State-of-the-Art Performance: The method outperforms existing BBDA, Source-Free Domain Adaptation (SFDA), and even some Unsupervised Domain Adaptation (UDA) methods that have access to source data.

4. Experimental Results

The method was evaluated on three standard benchmarks: Office-31, Office-Home, and VisDA-17.

Office-31: Achieved an average accuracy of 93.1%, surpassing the second-best BBDA method (AEM) by 1.2% and outperforming most SFDA methods.
Office-Home: Achieved 83.2% average accuracy, consistently outperforming all compared methods (UDA, SFDA, and BBDA) across all tasks.
VisDA-17: Achieved 90.6% average accuracy, ranking first or second on more than half of the individual tasks.
Comparison: Notably, DDSR outperforms methods that have access to the source data or source model parameters, demonstrating the efficacy of leveraging ViLs and the proposed distillation strategy.
Ablation Studies: Confirmed that removing the adaptive fusion, subnetwork rectification, or prototype correction leads to significant performance drops, validating the necessity of each component.

5. Significance

Privacy-Preserving AI: DDSR provides a robust solution for scenarios where data privacy laws or proprietary restrictions prevent sharing source data or model weights, which is increasingly common in commercial AI services (APIs).
Bridging the Gap: It successfully bridges the gap between data-driven adaptation (relying on source models) and semantic-driven adaptation (relying on ViLs), showing that combining both yields superior results.
Practical Applicability: The framework is flexible regarding target model architecture (does not need to match the source) and is effective even on resource-constrained devices where only API queries are possible.

In conclusion, this paper presents a significant advancement in black-box domain adaptation by effectively leveraging the complementary strengths of task-specific black-box predictions and general-purpose vision-language models, while introducing novel regularization techniques to handle the inherent noise in such settings.