CLIP-driven Zero-shot Learning with Ambiguous Labels

Imagine you are trying to teach a robot to recognize animals. In a perfect world, you would show it a picture of a fox and say, "This is a fox," and a picture of a wolf and say, "This is a wolf." The robot learns quickly.

But in the real world, getting perfect labels is hard, expensive, and slow. Often, we have to rely on crowdsourcing or quick online searches. This leads to messy data. You might show the robot a picture of a fox, but the label says, "This could be a fox, a dog, or a wolf." The robot gets confused. If you try to teach it "Zero-Shot Learning" (recognizing animals it has never seen before, like a "pangolin," based on what it knows about foxes), this confusion makes the robot fail completely. It starts memorizing the wrong answers instead of learning the real rules.

This paper introduces a new system called CLIP-PZSL to fix this mess. Here is how it works, using some simple analogies:

1. The Super-Translator (CLIP)

First, the system uses a powerful AI tool called CLIP. Think of CLIP as a super-translator that speaks both "Image" and "Text."

If you show it a picture of a lion, it doesn't just see pixels; it understands the concept of a lion.
If you type "a photo of a lion," it understands that text in the same way.
Because it has seen millions of images and texts, it already knows what a "pangolin" looks like, even if you've never shown it a picture of one. This is the "Zero-Shot" magic.

2. The Detective Block (Semantic Mining)

The problem is the messy labels. The robot sees a picture of a fox, but the label list says: [Fox, Dog, Wolf, Lion]. Only "Fox" is right. The others are "noise."

The paper adds a special Semantic Mining Block. Imagine this as a detective or a filter.

Instead of blindly trusting the list of candidates, the detective compares the picture to every word on the list.
It asks: "Does this picture really look like a 'Wolf'? No, the features don't match."
It asks: "Does it look like a 'Fox'? Yes, the features match perfectly."
Over time, this detective learns to ignore the "Wolf" and "Dog" labels and focus only on the "Fox." It essentially cleans the data while it is learning.

3. The "Partial" Scorecard (The New Loss Function)

Usually, when training AI, if the answer is wrong, the computer gets a big "F" and tries to fix it. But with messy labels, the computer doesn't know which part of the answer is wrong.

The authors created a new scoring system called the Partial Zero-Shot Loss.

Imagine a teacher grading a test where the student circles three answers: A, B, and C. The teacher knows the right answer is A, but the student circled all three.
Instead of failing the student, the teacher says: "Okay, since you circled A, I'll give you partial credit. But since you also circled B and C, I'll give you less credit for those."
As the student (the AI) keeps taking the test, the teacher gets smarter at figuring out which answer was actually the right one all along. The AI gradually "discovers" the true label hidden inside the noise.

4. The Goal: Seeing the Unseen

The ultimate goal is Zero-Shot Learning.

Seen Classes: The animals the robot has been trained on (Fox, Dog, Wolf), but with messy labels.
Unseen Classes: Animals the robot has never seen (Pangolin, Platypus).

Because the robot learned to ignore the noise and focus on the true meaning of the "Fox" (thanks to the Detective and the Scorecard), it can now look at a picture of a Pangolin and say, "I've never seen this, but it shares traits with the 'Fox' I learned about, so I can recognize it."

Why is this a big deal?

Realism: It accepts that real-world data is messy and noisy, rather than pretending it's perfect.
Efficiency: It saves us from spending thousands of dollars hiring experts to label every single photo perfectly.
Performance: The experiments show that this method is much better at recognizing new things (like rare birds or flowers) even when the training data is full of mistakes, compared to older methods that get confused by the noise.

In short: This paper teaches a robot how to learn from a messy, unreliable teacher, filter out the wrong advice, and still become an expert at recognizing things it has never seen before.

1. Problem Statement

Zero-Shot Learning (ZSL) aims to recognize unseen classes by transferring knowledge from seen classes via shared semantic information (e.g., attributes, text descriptions). However, most existing ZSL methods assume that training data has accurate, clean labels.

In real-world scenarios, obtaining clean labels is costly and difficult. Alternative labeling methods (e.g., crowdsourcing, online queries) often introduce noise and ambiguous labels. In this context, a single instance is associated with a set of candidate labels, where only one is the ground truth, but the correct label is unknown during training.

The Challenge: Existing Partial Label Learning (PLL) methods handle ambiguity but are limited to seen classes only. They cannot generalize to unseen classes. Conversely, standard ZSL methods fail when trained on noisy/ambiguous data, leading to overfitting and poor generalization.
The Goal: Develop a framework that combines ZSL and PLL to recognize unseen classes while effectively handling ambiguous labels in the training data.

2. Methodology: CLIP-PZSL Framework

The proposed CLIP-PZSL framework leverages the Contrastive Language-Image Pre-training (CLIP) model to align visual and textual features while addressing label ambiguity through three core components:

A. Feature Extraction via CLIP

Image Encoder: Extracts visual features ( $p_i$ ) from input images.
Text Encoder: Encodes class labels into text embeddings ( $c_i$ ) using prompt engineering (e.g., "A photo of a {}").
Initialization: Learnable label embeddings ( $L$ ) are initialized using the text embeddings from CLIP.

B. Semantic Mining Block

This block is designed to extract discriminative features and detect noisy labels by fusing instance and label information.

Architecture: It utilizes a Transformer-based structure containing:
1. Self-Attention: Processes instance embeddings.
2. K-means Cross-Attention: This is the core innovation. It treats label embeddings as queries ( $Q$ ) and instance embeddings as keys ( $K$ ) and values ( $V$ ). It uses a Gumbel-Softmax function to approximate an $\arg\max$ operation, effectively pooling instance features relevant to specific labels.
3. Multi-Layer Perceptron (MLP): Refines the features.
Function: By iteratively updating label embeddings ( $L_m$ ) based on instance relevance, the block adaptively extracts key semantic information, allowing the model to distinguish between true labels and noise within the candidate set.

C. Instance-Label Alignment with Partial Zero-Shot Loss

To resolve ambiguity, the model employs a robust loss function that iteratively refines label confidence.

Noise Detection: Computes cosine similarity between instance embeddings and text embeddings to generate a correction matrix ( $R$ ). High similarity suggests a higher probability of being the ground truth.
Partial Zero-Shot Loss ( $\mathcal{L}$ ): Composed of two terms:
1. Weighted Cross-Entropy ( $\mathcal{L}_{ce}$ ): Uses the correction weights ( $r_{ij}$ ) and iteratively refined confidence weights ( $Y_{ij}$ ) to guide the classifier. As training progresses, the model progressively identifies the ground-truth label within the candidate set.
2. Mean Squared Error ( $\mathcal{L}_{dist}$ ): Aligns the learned label embeddings ( $L_t$ ) with the original CLIP text embeddings ( $C$ ) to minimize semantic mismatch and ensure the embeddings remain in the correct semantic space.
Iterative Refinement: The ground-truth labels are progressively identified during training. These refined labels and embeddings are then used to improve the semantic alignment for subsequent iterations, enhancing generalization to unseen classes.

3. Key Contributions

First ZSL with Ambiguous Labels: To the authors' knowledge, this is the first work to effectively handle ambiguous labels in seen classes specifically for the Zero-Shot Learning task.
Semantic Mining Block: A novel module designed from a clustering perspective (using K-means cross-attention) to extract latent semantic information and align it with label embeddings for superior noisy-label detection.
Robust Partial Zero-Shot Loss: A new loss function that mitigates noise impact by weighting candidate labels based on relevance and enforcing dimension alignment between instance and label embeddings to minimize semantic discrepancies.

4. Experimental Results

The method was evaluated on six public benchmarks: CIFAR-10, CIFAR-100, Food-101, CUB (Caltech-UCSD Birds), Flowers-102, and AWA2. Synthetic ambiguous labels were generated with noise probabilities ( $q$ ) of 0.1, 0.3, and 0.5.

Performance: CLIP-PZSL significantly outperformed state-of-the-art methods (including CLIP, CALIP, ABP, SDGZSL, Transzero, and CoAR-ZSL) in both Seen Accuracy (S.Acc) and Unseen Accuracy (U.Acc).
- Example: On CIFAR-10 with $q=0.1$ , CLIP-PZSL achieved 92.15% (Seen) and 95.45% (Unseen), compared to 87.23% and 89.90% for standard CLIP.
- Example: On AWA2, it achieved 95.09% (Seen) and 90.37% (Unseen), vastly outperforming traditional methods which dropped significantly under noise (e.g., Transzero dropped to 3.33% seen accuracy at $q=0.5$ ).
Robustness: Traditional ZSL methods suffered severe performance degradation under ambiguous labels due to overfitting on noise. CLIP-PZSL maintained high performance even with high noise levels ( $q=0.5$ ).
Ablation Study: Removing the semantic mining block or the partial zero-shot loss components resulted in significant performance drops, confirming that both the feature extraction mechanism and the iterative loss refinement are critical for disambiguation and alignment.

5. Significance

Bridging the Gap: This work successfully bridges the gap between Partial Label Learning (handling ambiguity) and Zero-Shot Learning (generalizing to unseen classes), a combination previously unaddressed.
Real-World Applicability: By demonstrating robustness against noisy and ambiguous annotations, CLIP-PZSL offers a practical solution for real-world applications where perfect labeling is impossible, reducing the reliance on expensive manual annotation.
Semantic Alignment: The approach highlights the importance of aligning instance and label embeddings in a high-dimensional space to minimize semantic mismatch, providing a new direction for improving the generalization capabilities of vision-language models.