pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

The Big Idea: Don't Hire Just One Specialist

Imagine you are trying to solve a very complex problem, like diagnosing a rare disease or identifying a specific type of bird.

In the past, AI researchers would hire one "super-expert" (a pre-trained AI model) and try to teach it everything it needed to know for your specific job.

The Problem: If you hire a generalist (like a model trained on millions of cat and dog photos), they might be great at spotting animals but terrible at reading X-rays. If you hire a medical specialist, they might be amazing at X-rays but clueless about fine-grained details like the difference between two similar flower species.
The Old Way: You had to pick one expert and hope they were "good enough" at everything, or you had to retrain them from scratch, which is expensive and slow.

pMoE changes the game. Instead of hiring one expert, it builds a team of diverse experts and creates a smart manager to coordinate them.

The Cast of Characters

1. The Experts (The "Prompt Tokens")

Think of the AI model as a giant library. Usually, you just ask the librarian for a book.
In pMoE, instead of just one librarian, you have a panel of specialists:

Expert A: A generalist who knows everything about nature and everyday objects.
Expert B: A medical genius who knows how to read X-rays and MRI scans.
Expert C: A surgeon who understands precise shapes and boundaries.

In the paper, these "experts" are actually specialized notes (tokens) attached to the AI. They carry specific knowledge from different pre-trained models (like DINO for general vision, or LVM-Med for medical images).

2. The Dispatcher (The "Smart Manager")

This is the magic ingredient. You can't just have all the experts shouting advice at once; that would be chaos. You need a manager who decides who speaks up and when.

The paper introduces a Learnable Dispatcher.

How it works: Imagine you are looking at a picture of a lung tumor.
- The Dispatcher looks at the image and says, "Okay, for the first layer of analysis, let's listen to the Generalist to see what kind of tissue this is."
- Then, for the next layer, it says, "Now, let's bring in the Medical Expert to check for specific patterns."
- Finally, it says, "For the final decision, let's combine the Surgeon's advice on the shape."
The Magic: The Dispatcher doesn't just pick one; it mixes their advice dynamically. It figures out exactly how much weight to give to each expert's opinion based on the specific task at hand.

The Analogy: The "All-Star" Cooking Team

Imagine you are trying to cook a perfect meal for a very picky guest who wants a dish that is both a gourmet French dessert and a spicy Indian curry.

The Old Way (Single Prompt Tuning): You hire one chef who is good at both cuisines but maybe only "okay" at each. You try to tweak their recipe slightly. The result is mediocre.
The pMoE Way: You hire a French Pastry Chef and an Indian Spice Master.
- You don't ask them to cook the whole meal alone.
- You have a Head Chef (The Dispatcher).
- When it's time to make the dough, the Head Chef asks the French Chef for advice.
- When it's time to mix the spices, the Head Chef asks the Indian Chef.
- The Head Chef blends their instructions perfectly in real-time.

The Result: You get a dish that is far better than what either chef could make alone, and you didn't have to hire a new, expensive "super-chef" to do the whole job.

Why This Matters (The "So What?")

The paper tested this idea on 47 different tasks, ranging from:

General Tasks: Identifying birds, cars, and flowers.
Medical Tasks: Detecting polyps in the colon, identifying skin cancer, and reading X-rays.

The Results:

Better Accuracy: The "All-Star Team" (pMoE) beat the single-expert methods by a significant margin. It was better at spotting the tiny details in medical images and the subtle differences in bird feathers.
Efficiency: Even though they are using multiple experts, the system is very efficient. It doesn't require retraining the whole AI from scratch. It only tweaks the "notes" (prompts) and the "manager" (dispatcher). This saves massive amounts of computing power.
Versatility: It works equally well for a general photo of a dog and a complex MRI scan of a brain.

In a Nutshell

pMoE is a new way to teach AI. Instead of forcing one AI to be good at everything, it creates a dynamic team where different experts contribute their specific knowledge to solve a problem, managed by a smart system that knows exactly who to listen to at every step. It's like upgrading from a solo musician to a perfectly conducted orchestra.

1. Problem Statement

Parameter-efficient fine-tuning (PEFT), particularly Visual Prompt Tuning (VPT), has become a standard for adapting large pre-trained Vision Transformers (ViTs) to downstream tasks with minimal trainable parameters. However, existing VPT methods suffer from two main limitations:

Single-Source Knowledge: Most approaches adapt a single pre-trained model (either general-purpose or specialized, e.g., medical). This limits the model's ability to leverage synergies between different domain knowledge sources (e.g., combining general semantic understanding with specialized medical feature extraction).
Static Adaptation: Current methods often use static prompt tokens or simple gating mechanisms that do not dynamically coordinate multiple experts to resolve conflicts or redundancies during the adaptation process.

The core challenge is how to integrate expertise from multiple diverse domain experts into a unified framework that maximizes performance and efficiency without significantly increasing computational costs.

2. Methodology: pMoE Framework

The authors propose pMoE (Mixture-of-Experts Prompt Tuning), a novel framework that extends visual prompt tuning to handle multiple pre-trained models (experts) simultaneously. The framework consists of three key components:

A. Expert Prompt Tokens (EPTs)

Instead of a single set of prompt tokens, pMoE assigns a dedicated set of learnable prompt tokens ( $P_{expert_k}$ ) to each pre-trained expert model ( $k = 1, \dots, K$ ).

These tokens are injected into the input sequences of their respective experts.
They allow each expert to access domain-specific knowledge (e.g., one expert might be a DINO model for general features, another a medical-specific model like LVM-Med).
EPTs can be injected layer-wise (similar to VPT-Deep) to capture information at different abstraction levels.

B. Learnable Dispatcher Module

To effectively fuse knowledge from diverse experts and resolve potential conflicts, pMoE introduces a dynamic dispatcher layer inserted before the transformer layers of each expert.

Input: The dispatcher takes the current state of an expert, including:
1. Current Expert Prompt Tokens ( $P_{expert_k}$ ).
2. Accumulated prompt tokens from the previous layer ( $Z_{P, expert_k}$ ).
3. Patch tokens from the previous layer ( $Z_{expert_k}$ ).
Mechanism: The dispatcher uses Multi-Layer Perceptrons (MLPs) to process these inputs and generate dynamic dispatching weights ( $D_{expert_k}$ ).
Fusion: These weights are used to perform a weighted sum of the prompt tokens from all experts. This creates Integrated Prompt Tokens (IPTs) ( $\hat{P}_{expert_k}$ $\hat{P}_{e x p er t_{k}}$ ) for the current expert.
- Mathematically, the $n$ -th token for expert $k$ is a weighted combination of the $n$ -th tokens from all $K$ experts.
Output: The fused IPTs are then passed to the next transformer layer of the specific expert.

C. Dynamic Path Selection

Unlike static ensemble methods, pMoE dynamically selects and fuses information based on the task complexity and the specific state of the input data. This allows the model to route information through the most relevant "expert paths" at different layers, ensuring that general knowledge and specialized knowledge are utilized optimally.

3. Key Contributions

Novel Framework: Introduction of pMoE, the first framework to apply Mixture-of-Experts (MoE) mechanisms specifically to Visual Prompt Tuning, enabling the simultaneous adaptation of multiple pre-trained models.
Dynamic Dispatcher: Design of a lightweight, learnable dispatcher module that dynamically fuses expert-specific prompt tokens. This allows for fine-grained control over adaptation, balancing generalization and specialization without inflating the parameter count significantly.
Comprehensive Evaluation: Extensive validation across 47 visual adaptation tasks, covering both general domains (classification, segmentation) and specialized medical domains (polyp detection, skin lesion analysis, X-ray, CT, MRI).

4. Experimental Results

The authors evaluated pMoE on diverse benchmarks, including VTAB-1K, FGVC (fine-grained classification), ADE20K (segmentation), and Med-VTAB (medical imaging).

General Domain Performance:
- On VTAB-1K, pMoE (combined with LSPT) achieved an average score of 80.31, outperforming the previous state-of-the-art (LSPT) by 2.36% on ImageNet-21K pre-trained weights.
- On FGVC datasets (CUB, Flowers, Cars, etc.), pMoE consistently improved accuracy over baselines like VPT, GaPT, and LSPT. For example, it improved CUB accuracy by 1.22% over LSPT.
Medical Domain Performance:
- pMoE showed significant gains in medical tasks. On Kvasir-seg (polyp segmentation), it improved mIoU by 4.15 points over LSPT.
- In skin lesion segmentation, it achieved a 2.72 point improvement.
- It demonstrated robustness across various modalities (X-ray, OCT, CT, MRI), outperforming single-expert baselines in distinguishing subtle pathological features.
Efficiency:
- The method maintains high parameter efficiency. The dispatcher is shared across experts, and the increase in trainable parameters is minimal (mostly coming from the prompt tokens themselves).
- The computational overhead is manageable due to the sparse nature of the dispatching logic.

5. Significance and Conclusion

Synergy of Experts: The paper demonstrates that integrating diverse domain knowledge (general + specialized) via a dynamic MoE mechanism yields superior performance compared to adapting a single model, even when that single model is highly specialized.
Flexibility: The pMoE framework is compatible with existing prompt tuning methods (VPT, GaPT, LSPT) and can be applied to various pre-trained backbones (ViT-B, ViT-L, MAE, DINO, MoCo v3).
Scalability: Experiments show that pMoE scales effectively with larger backbone models (ViT-L) and varying numbers of experts, though diminishing returns were observed beyond 6 experts.
Impact: pMoE sets a new standard for efficient visual adaptation, proving that "prompting diverse experts together" is a highly effective strategy for solving complex, real-world visual tasks in both general and high-stakes medical domains.