Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning

Imagine you are teaching a very smart but rigid robot how to recognize things in the world.

The Problem: The Robot's "Closed Book"

Currently, most AI models are like students who have studied a specific textbook. They know exactly what a "red apple" or a "ripe banana" looks like because they've seen those exact pictures in their training data.

But the real world is messy. What happens if you show the robot a "wet shirt"?

If it has only seen "dry shirts" and "wet towels," it might get confused.
If it has never seen a "jacket" before, but knows what a "shirt" is, it might struggle to understand that a jacket is just a "bigger, heavier shirt."

This is the challenge of Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL). The goal is to teach the AI to recognize new combinations of things (like "wet shirt" or "jacket") even if it has never seen that exact phrase before, by using what it already knows.

The Old Way: Guessing Without a Map

Previous methods tried to solve this by just showing the AI more pictures. But it's like trying to learn a new language by only memorizing flashcards of specific sentences. If you ask the AI to translate a sentence it hasn't memorized, it fails.

The researchers in this paper noticed something interesting about how AI "thinks" (specifically, a powerful AI called CLIP). They found that in the AI's brain, words with similar meanings are neighbors.

"Wet" and "Damp" sit close together.
"Shirt" and "Jacket" sit close together.

It's like a map where cities with similar cultures are grouped in the same neighborhood.

The Solution: SPA (Structure-Aware Prompt Adaptation)

The authors created a new method called SPA. Think of SPA as a smart guide that helps the AI navigate this map. It works in two stages:

1. Training: "Don't Mess Up the Neighborhood" (Structure-aware Consistency Loss)

Imagine the AI is learning to draw. Before it starts, the "neighborhoods" of words (like the "Clothing District" or the "Weather District") are already organized nicely by the AI's pre-training.

If you just let the AI learn freely, it might get too excited and mess up the map, moving "Shirt" so far away from "Jacket" that they no longer recognize each other.

The Analogy: It's like a teacher telling a student: "You can learn new facts, but don't move the furniture in the living room! Keep the 'Shirt' and 'Jacket' chairs next to each other."
What SPA does: It adds a rule (a "loss function") that forces the AI to keep these similar words close together while it learns. This preserves the "structure" of the map.

2. Testing: "Use the Neighborhood to Find Your Way" (Structure-guided Adaptation Strategy)

Now, imagine you show the AI a picture of a "Damp Shirt" (a combination it has never seen).

The AI knows "Shirt."
The AI knows "Wet."
But it doesn't know "Damp."

Without SPA, the AI might freeze. With SPA, the AI looks at its map. It sees that "Damp" is a neighbor to "Wet." Since it knows how to handle "Wet Shirt," it uses that knowledge to guess how "Damp Shirt" should look.

The Analogy: It's like meeting a stranger at a party. You don't know them, but you know their friend. You assume the stranger is probably friendly because their friend is friendly. SPA uses the "friend" (the seen concept) to figure out the "stranger" (the unseen concept).

Why is this a big deal?

It's Plug-and-Play: You don't need to rebuild the whole robot. You just plug this "guide" into existing AI models, and they instantly get smarter at handling new things.
It's Efficient: It doesn't require massive amounts of extra computing power. It's a lightweight upgrade.
It Works on the Hard Stuff: The paper shows that while other methods struggle with completely new combinations (like "burnt cake" or "rusty truck"), SPA gets significantly better at guessing them correctly.

The Bottom Line

The paper teaches AI to stop memorizing specific answers and start understanding relationships. By respecting the natural "neighborhoods" of words and using them as a bridge, the AI can generalize its knowledge to understand things it has never seen before, just like a human does.

In short: SPA gives the AI a compass. Instead of getting lost in a new city, it looks at the street signs of the cities it does know to figure out where it is.

1. Problem Statement

Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL) aims to recognize attribute-object compositions where the model must generalize to novel combinations involving unseen attributes, unseen objects, or both, beyond the training vocabulary.

Limitation of Existing Methods: While prompt tuning methods using pre-trained Vision-Language Models (VLMs) like CLIP have achieved strong results in standard Compositional Zero-Shot Learning (CZSL), they struggle in the open-vocabulary setting. Directly applying these methods often fails to generalize to unseen concepts because the models are limited to the seen training distribution and lack mechanisms to infer the semantics of novel primitives.
Human Analogy: The paper draws inspiration from human cognition, where humans infer the meaning of unseen concepts (e.g., "damp") by drawing analogies to semantically similar seen concepts (e.g., "wet").
Core Hypothesis: Semantically related attributes and objects form consistent local structures in the embedding space of pre-trained models like CLIP. These structures are preserved during training and can serve as a prior to guide the generalization from seen to unseen concepts.

2. Methodology: Structure-aware Prompt Adaptation (SPA)

The authors propose SPA, a plug-and-play framework designed to enhance existing CLIP-based prompt tuning methods. It consists of two main components operating at different stages:

A. Training Stage: Structure-aware Consistency Loss (SCL)

The goal is to preserve the local structural relationships of seen attributes and objects during the fine-tuning process, preventing the model from distorting the semantic neighborhood established by the pre-trained CLIP model.

Mechanism:
1. Extract initial text embeddings ( $t^{(0)}$ ) for seen primitives using the frozen CLIP text encoder.
2. Compute the initial similarity matrix and identify the Top-K most similar neighbors for each primitive to define its local structure.
3. During training, as the prompt parameters are updated to produce new embeddings ( $t^{(+)}$ ), the method computes the updated similarity matrix.
4. Loss Function: A KL-divergence loss is applied to enforce consistency between the probability distribution of neighbors in the initial space and the updated space. This ensures that semantically similar concepts remain close even after fine-tuning.

B. Inference Stage: Structure-guided Adaptation Strategy (SAS)

The goal is to dynamically adapt the representations of unseen attributes and objects at test time by leveraging the structural information learned from seen primitives.

Mechanism:
1. Identify the Top-K most similar seen primitives for a given unseen primitive based on their initial CLIP embeddings.
2. Calculate the parameter shift ( $\Delta P$ ) observed in the seen primitives during training (i.e., the difference between their final and initial embeddings).
3. Weighted Aggregation: Compute a weighted average of these shifts using the similarity scores as weights.
4. Adaptation: Apply this aggregated shift to the initial embedding of the unseen primitive to generate an adapted representation ( $P^{(+)}$ ) that aligns with the learned local structure.

3. Key Contributions

Pioneering Exploration: The paper presents the first exploration of CLIP-based prompt tuning specifically tailored for the OV-CZSL task, demonstrating its potential to bridge the gap between closed-set and open-vocabulary settings.
SPA Framework: Introduction of the Structure-aware Prompt Adaptation (SPA) method, which uniquely combines:
- SCL: Preserves local structural coherence of seen concepts during training.
- SAS: Aligns unseen concepts with the learned structure of similar seen concepts during inference.
Plug-and-Play Design: SPA is designed as a modular component that can be seamlessly integrated into existing prompt tuning baselines (e.g., CSP, DFSP, Troika) without requiring architectural changes to the backbone.
Empirical Validation: Extensive experiments on four major benchmarks (MIT-States, C-GQA, VAW-CZSL, UT-Zappos) proving that SPA significantly boosts performance on open-vocabulary splits while maintaining competitive results on seen compositions.

4. Experimental Results

The authors evaluated SPA on four datasets, integrating it with four strong baselines (CSP, HPL, DFSP, Troika).

MIT-States: SPA improved the Harmonic Mean (HM) by +2.6% overall. Notably, it achieved a +11.9% gain on the $A^*O$ (unseen attribute) split and +18.0% on the $A^*O^*$ (unseen attribute & object) split.
C-GQA: On this more challenging dataset, SPA boosted the overall HM by +6.3% and the AUC by +7.8%. The most significant improvement was on the $A^*O^*$ split, showing a +55.1% relative gain (from 7.07 to 10.97).
VAW-CZSL: SPA set new state-of-the-art results, improving the average HM from 16.00 to 17.30 and the $A^*O^*$ score by +33.0%.
UT-Zappos: Even on fine-grained shoe images, SPA achieved a +2.02 absolute gain in HM and a massive 4x improvement in the $A^*O^*$ metric.
Efficiency: The method introduces minimal computational overhead (approx. 5% increase in training time and negligible inference latency), making it highly practical.

5. Significance and Impact

Bridging the Semantic Gap: SPA effectively addresses the "semantic gap" in open-vocabulary learning by utilizing the inherent structural priors of large-scale pre-trained models (CLIP) rather than relying on external semantic knowledge bases or weak encoders.
Robust Generalization: By enforcing structural consistency, the method prevents the model from overfitting to the training distribution, allowing it to generalize robustly to entirely novel concepts.
Scalability: The approach is computationally efficient and scalable, offering a viable solution for real-world scenarios where new attributes and objects frequently emerge.
Qualitative Insights: Visual analysis confirms that SPA corrects semantic errors in unseen compositions (e.g., distinguishing "barren" from "weathered") by leveraging analogical reasoning, while maintaining high accuracy on seen data.

In conclusion, this paper establishes that preserving and leveraging the local structural consistency of semantic embeddings is a critical factor for successful Open-Vocabulary Compositional Zero-Shot Learning, offering a simple yet highly effective framework for future research in this domain.