Active Prompt Learning with Vision-Language Model Priors

Imagine you have a super-smart, all-knowing librarian named CLIP. This librarian has read every book in the world and seen every picture ever taken. Because of this, if you ask, "Show me a picture of a cat," CLIP can find one instantly without ever being taught what a cat looks like specifically. This is called "zero-shot" learning.

However, there's a catch. To get the best results, you have to ask the librarian very specific questions (called prompts). If you ask, "A photo of a cat," it might work okay. But if you ask, "A fluffy tabby cat sitting on a windowsill," it works much better.

The problem is that finding the perfect question for every new task is hard, time-consuming, and expensive. You can't just ask the librarian to read every single book in the library to learn; you need to pick the right few books to teach it quickly.

This paper introduces a new, budget-friendly way to teach this librarian using Active Prompt Learning. Here is how it works, broken down into simple analogies:

1. The Problem: The "Cold Start" and the "Wasted Budget"

Imagine you are a teacher trying to teach a new student (the AI) about 100 different types of birds. You have a limited budget of 100 stickers (labeled data) to give the student as rewards for correct answers.

The Old Way: Most teachers just pick 100 random birds from a huge pile. They might pick 50 pictures of sparrows and 0 pictures of eagles. The student gets confused because they haven't seen enough variety. Or, they might pick 50 pictures of birds the student already knows perfectly, wasting the stickers.
The Goal: We want to pick the most useful birds to teach the student, using as few stickers as possible, while making sure we cover all types of birds equally.

2. The Solution: A Two-Step Magic Trick

The authors propose a framework with two main tricks to solve this:

Trick A: The "Class-Guided Map" (Better Sorting)

Usually, when we try to sort a pile of mixed-up photos, we just look at the colors and shapes (Image Features). But the librarian (CLIP) also knows the names of things (Text Features).

The Analogy: Imagine you have a pile of mixed fruit.
- Old Method: You sort them by color. You might put a red apple and a red tomato in the same pile.
- New Method (Class-Guided Clustering): You ask the librarian, "Is this an apple or a tomato?" and use that knowledge to help sort the pile.
- How it works: The AI combines the picture of the bird with the text description of what it might be. This creates a "Class-Guided Feature." Now, when the AI sorts the birds into groups (clusters), it doesn't just group them by "looks like a sparrow"; it groups them by "is actually a sparrow."
- The Result: From the very first day (Round 1), the AI can pick a perfect, balanced mix of birds from every group, avoiding the "cold start" problem where it has no idea what to pick.

Trick B: The "Confidence Check" (Saving Money)

Once the AI picks a group of birds to show the teacher, it doesn't always need the teacher to label them.

The Analogy: Imagine the librarian is 99% sure a picture is a "Blue Jay." Why pay a human to confirm it?
The Old Way: The teacher pays for a label for every single picture the AI picks, even if the AI is already 100% sure. This burns the budget fast.
The New Method (Selective Querying): The AI checks its own confidence.
- If the AI is unsure (e.g., "Is this a Blue Jay or a Jay?"), it asks the human teacher for the answer.
- If the AI is very confident (e.g., "That's definitely a Blue Jay!"), it gives itself a "pseudo-label" (a fake label it trusts) and saves the sticker.
- The Twist: Different birds are harder to tell apart. The AI learns that "Blue Jays" are easy, but "Warblers" are hard. So, it sets different confidence rules for each bird type. It saves more stickers on easy birds and spends more on hard ones.

3. The Result: Smarter Learning, Less Cost

By using these two tricks, the AI learns faster and more accurately than previous methods.

Efficiency: It achieves the same high accuracy as other methods but uses 17.6% fewer labeled examples (stickers).
Fairness: It ensures every type of bird gets equal attention, preventing the AI from only learning about the "popular" birds.
Scalability: It works even on massive datasets (like ImageNet with millions of images) because the sorting method is lightweight and fast.

Summary

Think of this paper as a smart shopping assistant for training AI.

Instead of blindly grabbing random items from the shelf, it uses a smart map (combining pictures and text) to grab exactly the right mix of items you need.
Instead of asking the cashier to check the price of every single item, it checks the price tag itself for items it's sure about, only asking the cashier for the tricky ones.

This saves time, money, and effort, allowing the AI to become an expert much faster.

Here is a detailed technical summary of the paper "Active Prompt Learning with Vision-Language Model Priors".

1. Problem Statement

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot capabilities across various tasks. However, adapting them to specific downstream tasks typically relies on prompt learning, where learnable text prompts are optimized while the encoders remain frozen.

The Bottleneck: Traditional prompt learning methods are model-centric, focusing on optimizing prompt architectures or loss functions given a fixed, small set of labeled data (few-shot). They often overlook the potential of data selection.
The Challenge: In Active Learning (AL) scenarios, selecting the most informative samples to label is crucial for budget efficiency. Existing AL methods for VLMs often suffer from:
1. Cold-start problems: Lack of reliable data evaluation in the initial round without labeled data.
2. Inefficient budget usage: Failing to account for the fact that VLMs may already be highly confident in certain classes, leading to unnecessary manual labeling.
3. Imbalanced priors: VLMs possess inherent knowledge imbalances across different classes, which standard acquisition functions fail to address effectively.

2. Methodology

The authors propose a budget-efficient Active Prompt Learning framework that leverages VLM priors through two main components: Class-Guided Clustering and Selective Querying.

A. Class-Guided Clustering (Warm-Start Mechanism)

To solve the cold-start problem and ensure diverse data selection from the very first round, the authors introduce a feature representation that combines visual and textual information.

Feature Construction: For an image $x$ $x$ , they construct a Class-Guided Feature ( $F_C$ $F_{C}$ ) by concatenating:
1. Image Features ( $I$ ): Extracted from the pre-trained image encoder.
2. Weighted Text Features ( $\tilde{T}_C$ ): A weighted sum of text embeddings for all classes, where weights are determined by the VLM's zero-shot confidence scores (similarity) for that image.
  $F_C(x) = [I(x), \sum_{c \in \mathcal{C}} p(y=c|x) \cdot \theta_{txt}(t_c)]$
Clustering: $K$ -means clustering is performed on these $F_C$ features.
Acquisition: In each round, the algorithm selects the image closest to the centroid of each cluster.
Dynamic $K$ : To ensure diversity as the labeled set grows, the number of clusters $K$ increases linearly with the round number ( $K = B \times r$ ), ensuring that new, unlabeled regions of the feature space are explored.
Advantage: Unlike traditional image-only clustering, $F_C$ aligns the feature space with the target classification task, allowing for effective "warm-start" selection without initial labels.

B. Selective Querying (Budget-Saving Mechanism)

To reduce the labeling budget, the authors propose a strategy to skip manual annotation for samples where the VLM is already confident.

Adaptive Class-Wise Thresholds: Instead of a global threshold, the method computes a confidence threshold ( $\epsilon_{r,c}$ ) for each class $c$ based on the average confidence of previously labeled samples in that class.
Query Logic:
- If a candidate sample's confidence for its predicted class exceeds the class-specific threshold $\epsilon_{r,c}$ , a pseudo-label is assigned (no human annotation needed).
- If the confidence is below the threshold, the sample is sent to human annotators for ground-truth labeling.
Unified Prompts: To prevent overfitting and ensure reliable confidence scores for thresholding, the authors utilize a unified prompt (shared across classes) rather than purely class-specific prompts during the confidence estimation phase.

C. Training Loop

The framework operates in rounds:

Extract class-guided features for the unlabeled pool.
Perform clustering and select representative candidates.
Apply selective querying to assign pseudo-labels or request annotations.
Train the prompt vectors on the accumulated dataset (re-initializing prompts at each round to avoid bias).

3. Key Contributions

Budget-Efficient Framework: A novel active learning framework specifically designed for VLMs that fully exploits pre-trained priors for both data selection (clustering) and label acquisition (pseudo-labeling).
Class-Guided Features: The introduction of a hybrid feature space (Image + Weighted Text) that enables effective clustering from the first round, solving the cold-start problem and improving cluster separation (verified via T-SNE and Adjusted Rand Index).
Adaptive Selective Querying: A mechanism to dynamically save labeling budgets by leveraging class-wise confidence thresholds, reducing the need for human annotation without sacrificing accuracy.
Synergy with Model-Centric Methods: The proposed data selection strategy acts as a plug-in that enhances existing model-centric prompt learning methods (e.g., MaPle, PromptSRC) when applied to curated datasets.

4. Experimental Results

The method was evaluated on seven diverse datasets (OxfordPets, FGVCAircraft, Caltech101, Flowers102, DTD, StanfordCars, EuroSAT) and the large-scale ImageNet dataset.

Performance: The proposed method (CB+SQ) consistently outperformed state-of-the-art baselines, including:
- Random sampling.
- Uncertainty-based methods (Entropy).
- Diversity-based methods (CoreSet, BADGE).
- Previous VLM-specific active learning (PCB).
Efficiency:
- Achieved higher accuracy with fewer labeled samples. For instance, on the first round, it showed a 19.5% performance gain over baselines.
- Reduced the total labeling budget by approximately 17.6% while maintaining comparable or superior accuracy.
Scalability: Successfully scaled to ImageNet (1.28M images), overcoming computational bottlenecks that limited previous methods like CoreSet and BADGE.
Generalization: Demonstrated strong performance in "Base-to-Novel" generalization scenarios, outperforming random selection on unseen novel classes.
Ablation Studies: Confirmed that:
- Class-guided features significantly outperform image-only features.
- Selective querying is most effective when paired with diversity-based acquisition (like their clustering) rather than uncertainty-based methods.
- Unified prompts yield more reliable confidence distributions for thresholding than class-wise prompts.

5. Significance

This paper shifts the paradigm in VLM adaptation from a purely model-centric view (optimizing the prompt architecture) to a data-centric view (optimizing which data is used to train the prompt).

Practical Impact: It offers a highly efficient solution for deploying VLMs in resource-constrained environments where human annotation is expensive or scarce.
Theoretical Insight: It demonstrates that the "prior knowledge" embedded in foundation models can be explicitly leveraged not just for inference, but to guide the data acquisition process itself, effectively turning the model into a smart curator of its own training data.
Future Direction: The work opens avenues for extending data-centric active learning strategies to other vision tasks (detection, segmentation) and other foundation models beyond CLIP.