Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

Imagine you've just bought a super-smart, all-knowing robot chef. This robot has read every cookbook in the world and seen millions of photos of food. You call it a "Foundation Model." It's amazing at recognizing a pizza or a hamburger because it's seen them a billion times.

But now, you want to use this robot in a small village in Africa to identify local dishes like Ekwang (a dish made of grated cocoyam wrapped in leaves) or Ndole. The robot has never seen these specific dishes before.

The Problem:
Before you hire a team of local experts to spend months labeling thousands of photos to teach the robot, you need to know: "Will this robot even be able to learn this dish, or is it completely clueless?"

Usually, the only way to find out is to do the expensive, time-consuming work of labeling the data first. If the robot fails, you've wasted all that time and money.

The Solution (The "One-Shot Probe"):
This paper introduces a clever, low-cost trick to peek inside the robot's brain before you do the heavy lifting. It's like asking the robot a single, tricky riddle to see if it has the right "muscles" to solve the puzzle.

Here is how the trick works, broken down into simple steps:

1. The "One-Shot" Setup

Instead of showing the robot 1,000 photos of Ekwang, you show it just one.

Step A: You take that one photo and ask a super-smart text AI (a Large Language Model) to write a perfect description of it.
- Example: "A plate of Ekwang, featuring grated cocoyam wrapped in green leafy vegetables..."
Step B: Then, you ask that same text AI to write five fake descriptions that sound very similar but describe different dishes. These are the "Counterfactuals" (or "Hard Negatives").
- Fake 1: "A bowl of Ndole, showcasing stewed bitterleaf..."
- Fake 2: "A serving of Eru, with finely chopped wild spinach..."
- Fake 3: "A plate of Jollof rice..."

2. The "Tug-of-War" Test

Now, you bring in the robot chef (the Vision-Language Model) and show it the original photo of the Ekwang. You ask it: "Which description matches this photo?"

Does the robot say, "Oh yes, that's the Ekwang description!"?
Or does it get confused and say, "Hmm, maybe that's the Ndole one?"

If the robot is good at understanding Ekwang, it will easily pick the real description and ignore the fake ones. If it's confused, it will struggle to tell them apart.

3. The Crystal Ball (Prediction)

The researchers realized that the robot's "score" on this single, tricky riddle tells them everything they need to know about how the robot would perform on the entire dataset of 1,000 photos.

They built a simple math formula (a linear regressor) that looks at these scores. It's like a weather forecaster who looks at a single drop of rain and a change in wind pressure to predict if a whole storm is coming.

High Score on the riddle? The robot will likely do great on the full dataset.
Low Score? The robot is probably going to fail, so don't waste money labeling the data.

Why This Matters

Saves Money & Time: You don't need to label thousands of images to know if a model will work. You just need one image per category.
Helps the "Global South": Most AI models are trained on data from the US and Europe. They often fail on African, Asian, or Indigenous topics. This tool helps researchers in those regions check if a model is actually useful for their local needs before they invest in it.
No "Black Box" Needed: You don't need to know how the robot was trained or see its secret training data. You just test its reaction to a few cleverly crafted questions.

The Analogy Summary

Think of the AI model as a student and the dataset as a final exam.

Old Way: You make the student take the full 100-question exam to see if they pass. If they fail, you wasted a lot of paper and time.
New Way (This Paper): You ask the student one very tricky question that mixes up the right answer with five very similar wrong answers. Based on how they handle that one question, you can predict with 96% accuracy whether they will pass the whole exam.

This method allows researchers to be smart about where they spend their resources, ensuring that AI tools are actually helpful for everyone, not just the people who are already well-represented in the data.

1. Problem Statement

Vision-Language Foundation Models (VLFMs), such as CLIP, have revolutionized computer vision by enabling zero-shot learning. However, their performance is inconsistent on novel, specialized, or underrepresented domains (e.g., specific African cultural contexts or local agricultural challenges).

The Core Issue: VLFMs are trained on massive, web-scraped datasets that follow a long-tail (Zipfian) distribution. Concepts from the Global South or niche domains are often underrepresented, leading to poor zero-shot performance.
The Bottleneck: Evaluating a VLFM's performance on a specific domain traditionally requires collecting and labeling a full test dataset. This is expensive, time-consuming, and often impossible for resource-constrained regions or niche applications.
The Goal: The authors propose a method to predict a VLFM's zero-shot accuracy on a target domain using only a single labeled image per class, without needing a full test set or access to the model's pretraining data.

2. Methodology: PreLabellingProbe

The proposed method, PreLabellingProbe, is a three-stage pipeline designed to probe the discriminative power of a VLFM's shared embedding space using minimal data.

Stage 1: Counterfactual Probing

For each class in the target domain, the system uses a single representative image ( $I_j$ ):

Plausible Caption Generation: A Large Language Model (LLM), conditioned on the image and its ground-truth label, generates a high-quality, plausible caption ( $T_{pc}$ ) describing the image.
Counterfactual Generation: The LLM is prompted to generate $N$ $N$ (set to 5) counterfactual captions ( $T_{cf}$ $T_{c f}$ ). These are semantically related but incorrect descriptions (e.g., describing a different dish that looks similar) acting as "hard negatives."
- Example: For an image of Ekwang (a Cameroonian dish), the plausible caption describes Ekwang, while counterfactuals describe Ndole, Eru, or Jollof rice.

Stage 2: Similarity Scoring

The VLFM under evaluation (e.g., OpenCLIP) computes embeddings for the image and all text prompts:

Standard Zero-Shot: The model computes similarity scores between the image and standard prompts (e.g., "A photo of {label}").
Counterfactual Scoring: The model computes cosine similarity scores between the image ( $I_j$ $I_{j}$ ) and the LLM-generated plausible caption ( $T_{pc}$ $T_{p c}$ ) and the counterfactual captions ( $T_{cf}$ $T_{c f}$ ).
- $S_{pc} = I_j \cdot T_{pc}$
- $S_{cf} = I_j \cdot T_{cf}$

Stage 3: Performance Prediction

The similarity scores serve as features for a Ridge Regression model.

Training: The regressor is trained on a diverse set of datasets where the ground truth is the actual zero-shot accuracy of the VLFM on the full test set.
Inference: For a new domain, the model takes the 1-shot similarity scores (from the single image per class) and predicts the expected zero-shot accuracy across the entire dataset.

3. Key Contributions

Data Efficiency: The method achieves high-fidelity performance prediction using only one labeled image per class, drastically reducing the cost and time required for model evaluation.
Counterfactual Reasoning: Unlike standard zero-shot evaluation, this approach uses LLM-generated "hard negatives" (counterfactuals) to probe the geometry of the model's embedding space. It measures not just if the model recognizes the correct label, but if it can distinguish the correct concept from semantically similar, confusing alternatives.
Equity and Inclusion: The method specifically targets underrepresented domains (e.g., African Food, Beans), providing a low-cost tool for researchers in the Global South to assess model suitability before committing resources to data annotation.
No Pretraining Access Required: The probe works as a black-box evaluation, requiring no access to the VLFM's training data or internal weights.

4. Experimental Results

The authors evaluated the method on 16 diverse datasets, including standard benchmarks (CIFAR-10, ImageNet) and underrepresented datasets (African Food, Beans).

Correlation: The method achieved a Pearson correlation coefficient of 0.96 between the predicted accuracy and the ground-truth zero-shot accuracy across test datasets.
Generalization: The model generalized well to unseen domains, including the specialized African datasets.
- African Food: Predicted 41.22% vs. Actual 38.24% (Error: +2.98%).
- Beans: Predicted 26.12% vs. Actual 39.84% (Error: -13.72%).
Ablation Study: The full method (combining LLM counterfactuals and standard prompts) outperformed variants using only LLM captions (Pearson-r 0.85) or only standard prompts (Pearson-r 0.95), confirming that the counterfactual signals provide complementary information.
Cost Efficiency: Generating captions and counterfactuals for a 6-class dataset took ~1.5 minutes with an API cost of ~$0.006. Inference took <5 seconds on a standard CPU.

5. Significance and Impact

This work addresses a critical gap in the deployment of Foundation Models: foresight.

Resource Optimization: It allows practitioners to determine if a Foundation Model is suitable for a specific domain before investing in expensive data collection and labeling efforts.
Combating Data Colonialism: By providing a tool to evaluate models on underrepresented data without needing massive local datasets, it empowers researchers in the Global South to make informed decisions about AI adoption.
Diagnostic Tool: It shifts the paradigm from "blindly fine-tuning" to "probing first," helping to identify which concepts a model truly understands versus which it merely guesses based on superficial patterns.

In summary, PreLabellingProbe offers a scalable, low-cost, and highly accurate mechanism to audit Vision-Language Models, ensuring that AI deployment is informed, equitable, and efficient.