Semi-Supervised Few-Shot Adaptation of Vision-Language Models

Imagine you are a brilliant art critic who has spent years studying millions of paintings from every era and culture. You know the difference between "Impressionism" and "Cubism" just by looking at them. This is like a Vision-Language Model (VLM): an AI that has learned to understand both images and words by reading a massive library of data.

Now, imagine a doctor asks you to identify a very specific, rare type of skin condition. But there's a catch: the doctor only has three photos of this condition to show you, and they are all from the same patient.

This is the "Few-Shot" problem. The AI is smart, but it's never seen this specific condition before, and it doesn't have enough examples to learn from. Usually, to teach the AI, you'd need hundreds of photos labeled by experts, which is expensive and slow.

The Problem: The "Unbalanced Class" Trap

In medical imaging, some diseases are common, but others are very rare. If you only have three photos to teach the AI, and two are of the common disease and one is of the rare one, the AI gets confused. It starts thinking, "Oh, this must be the common thing!" because that's what it saw most often. It ignores the rare one, and its performance drops.

The Solution: The "Ghost" Students

The authors of this paper, Julio and Ender, asked a simple question: "What if we have thousands of unlabeled photos of this condition sitting in a drawer, but no one has written down what they are?"

In the real world, hospitals have tons of images; they just don't have the time or money to label them all.

Their new method, called SS-Text-U, is like a clever teacher who uses those unlabeled photos to help the AI learn, even without knowing the exact answers yet. Here is how it works, using a simple analogy:

1. The "Text" Compass

The AI already knows what the diseases sound like because it was trained on text. It knows the definition of "Melanoma" or "Fracture."

The Old Way: The teacher points at the three labeled photos and says, "This is A, this is B."
The New Way: The teacher says, "Based on the words describing these diseases, I'm going to guess what these other 1,000 unlabeled photos probably are. Let's call them 'Ghost Labels'."

2. The "Optimal Transport" Dance

Here is the tricky part. If the teacher just guesses randomly, they might get it wrong. But the authors use a mathematical trick called Optimal Transport (think of it as a very smart dance).

Imagine you have a group of students (the unlabeled photos) and a set of desks (the disease categories).

The teacher knows the ratio of students in the class (e.g., "We know there are usually 10 students with the common cold and only 1 with the rare flu").
The teacher assigns the "Ghost Labels" to the students in a way that matches this ratio perfectly.
If the AI sees a photo that looks a bit like the flu, but the "flu desk" is already full (because the ratio says there should be more flu cases), the math forces the AI to look closer and assign it correctly, rather than just dumping it in the "common cold" pile.

3. The Result: Learning with Half the Effort

By using this "Ghost Label" system, the AI can learn from the 1,000 unlabeled photos to understand the shape and texture of the rare disease, even though no human ever wrote down the answer.

The Magic Stat:
The paper shows that this method allows the AI to perform just as well as if you had given it 4 to 8 labeled photos, even when you only gave it 1 or 2.

Translation: You can cut the cost of labeling medical images by 50% to 75%. You get the same smart AI, but you spend half the money and time.

Why This Matters

Think of it like training a new employee.

Old Way: You have to sit with them for a week, showing them 100 examples of every task.
New Way: You show them 2 examples, then let them practice on 1,000 "shadow" tasks where you give them hints based on the job description. They learn faster, make fewer mistakes on rare tasks, and you save a massive amount of time.

In a Nutshell

The authors built a tool that lets AI learn from unlabeled data by using text descriptions as a guide. It fixes the problem where AI gets confused by rare diseases, making medical AI cheaper, faster, and more accurate, especially when there are very few examples to start with.

1. Problem Statement

Vision-Language Models (VLMs) pre-trained on large, heterogeneous datasets have shown promise in medical imaging, particularly for few-shot classification (adapting to new tasks with very few labeled examples). However, existing few-shot adaptation methods face significant challenges in extremely low-shot regimes (e.g., 1-shot or 2-shot):

Class Imbalance: Medical datasets often have highly imbalanced class distributions. In low-shot settings, this leads to "support sets" where certain categories are underrepresented or entirely missing.
Annotation Costs: Acquiring expert-level annotations for medical data is expensive and time-consuming.
Underutilized Data: While unlabeled data is abundant in medical pipelines, standard few-shot methods typically ignore it, relying solely on the small set of labeled examples.
Limitations of Current Solvers: Existing text-informed linear probes often struggle when the labeled support set is too small to learn robust class prototypes, and they may suffer from the contrastive terms in standard loss functions that degrade generalization in imbalanced scenarios.

2. Methodology: SS-Text-U

The authors propose SS-Text-U, a semi-supervised solver that leverages unlabeled data to improve few-shot adaptation. The method introduces a text-informed linear probe that simultaneously learns class prototypes and pseudo-labels for unlabeled data.

Core Components:

Objective Function:
The method minimizes a combined loss function consisting of:
- Few-Shot Term ( $L_{FEW-SHOT}$ ): A tightness term (cross-entropy without the contrastive component) that pulls class prototypes toward labeled embeddings, regularized by a penalty to keep them close to pre-trained text prototypes.
- Unlabeled Term ( $L_U$ ): A tightness term between predictions and pseudo-labels ( $z$ ) for the unlabeled set ( $U$ ). Crucially, this term enforces a label distribution constraint, ensuring the pseudo-labels match the estimated true label distribution ( $m$ ) derived from the support set.
$\min_{W, z} L_{SEMI}(W, z) = L_{FEW-SHOT}(W) + \lambda_U L_U(W, z)$
Block-Coordinate Minimization (BCM):
Since the objective depends on two variables (prototypes $W$ and pseudo-labels $z$ ), the authors use an inexact block-coordinate minimization approach, alternating between two steps:
- Z-Block Update (Pseudo-labeling): For fixed prototypes, the algorithm optimizes the assignment of pseudo-labels ( $z$ ) to unlabeled data. This is formulated as a Constrained Similarity Maximization problem, solved efficiently using Optimal Transport (OT) via the Sinkhorn-Knopp algorithm. This ensures the pseudo-labels respect the global class distribution constraints.
- W-Block Update (Prototype Learning): For fixed pseudo-labels, the class prototypes ( $W$ ) are updated. Due to the specific formulation, this step admits a closed-form solution, avoiding the need for iterative gradient descent.
Handling Imbalance:
In extreme low-shot regimes (1-shot or 2-shot), some classes may have zero samples in the support set ( $m_c = 0$ ). The authors introduce a post-processing step where a small baseline value ( $b$ ) is added to all class counts in the distribution estimate to prevent the OT solver from assigning zero probability to rare classes.
Adaptive Regularization:
The method uses data-driven weights ( $\lambda_T$ and $\lambda_U$ ) that scale inversely with the number of shots per class ( $K_c$ ). This ensures that in high-shot regimes, the labeled data dominates, while in low-shot regimes, the unlabeled data and text priors play a more significant role.

3. Key Contributions

New Setting: Introduction of Semi-Supervised Few-Shot Learning for VLMs, a scenario that leverages abundant unlabeled data to reduce annotation costs.
SS-Text-U Solver: A principled, efficient solver that integrates text priors, labeled data, and unlabeled data via Optimal Transport and closed-form updates.
Efficiency: The method is computationally efficient (orders of magnitude faster than gradient-based methods) and requires no validation set for hyperparameter tuning.
Comprehensive Evaluation: Extensive experiments across 12 medical datasets and 3 specialized medical VLMs (Histology, Ophthalmology, Radiology).

4. Experimental Results

The authors evaluated SS-Text-U against state-of-the-art (SoTA) baselines, including gradient-based linear probes (CLAP, LP++) and training-free methods (Simple-Shot, SS-Text+).

Performance Gains: SS-Text-U consistently outperformed all baselines. Compared to the best training-free solver (SS-Text+), it achieved average accuracy gains of:
- 10.9% at 1-shot
- 7.1% at 2-shot
- 2.7% at 4-shot
- 1.3% at 8-shot
Annotation Efficiency: The method allows for a reduction in labeling effort by 50%–75%. For instance, SS-Text-U with 1-shot data performs comparably to SS-Text+ with 4-shot data.
Data Efficiency: Significant improvements were observed even with a small pool of unlabeled data (e.g., $M = C \times 8$ samples).
Convergence & Stability: The block-coordinate solver converges rapidly (typically within 3 iterations for $W$ and 10 for Sinkhorn iterations).
Ablation Studies:
- Removing the label distribution constraint (setting Sinkhorn iterations to 0) significantly degraded performance, proving the importance of structural constraints on pseudo-labels.
- Adaptive weighting of regularization terms ( $\lambda$ ) was shown to be superior to fixed values.

5. Significance and Impact

Practical Utility: The method directly addresses the bottleneck of expert annotation in medical imaging. By effectively utilizing unlabeled data, it makes deploying VLMs in low-resource clinical settings more feasible.
Robustness to Imbalance: The approach specifically targets the failure modes of current few-shot methods in imbalanced medical datasets, providing a more robust solution for rare disease detection.
Computational Efficiency: Unlike many semi-supervised approaches that require heavy training or complex architectures, SS-Text-U is lightweight, fast, and suitable for deployment on standard hardware.
Future Directions: The paper suggests that while the current method relies on feature embeddings, future work could integrate multi-view augmentations or confidence filtering to further enhance performance in difficult tasks where embedding clusters overlap significantly.

In summary, SS-Text-U represents a significant step forward in making Vision-Language Models more adaptable and cost-effective for medical imaging by bridging the gap between few-shot learning and semi-supervised learning through efficient, text-informed optimization.