WARM-CAT: Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

The Big Problem: The "New Menu" Surprise

Imagine you are a chef who has spent years perfecting a menu. You know exactly how to cook a "Red Apple" and a "Green Apple." You are an expert.

One day, a customer walks in and orders a "Blue Apple." You've never seen a blue apple before. In the world of AI, this is called Compositional Zero-Shot Learning (CZSL). The AI knows "Blue" and it knows "Apple," but it has never seen them combined.

The Old Way (The Broken Chef):
Traditional AI models are like chefs who freeze their recipe book the moment they stop training. When the "Blue Apple" order comes in, the chef panics. They might guess "Red Apple" because that's what they know best, or they get confused because the "Blue Apple" doesn't look like anything in their frozen memory. The AI fails because the world changed (new items appeared), but the AI didn't update its knowledge.

The Solution: WARM-CAT (The Adaptive Chef)

The authors propose a new system called WARM-CAT (Warm-Started Test-Time Comprehensive Knowledge Accumulation). Think of it as a chef who doesn't just memorize recipes but learns while cooking.

Here is how WARM-CAT works, broken down into four simple steps:

1. The "Warm Start" (Getting Ready Before the Rush)

Usually, when a new customer arrives, the chef starts with an empty counter. This is risky because the chef might guess wrong immediately.

What WARM-CAT does: Before the first customer arrives, the chef sets up a "Warm Start." They take all the apples they do know (Red, Green) and put them on the counter.
The Magic Trick: For the unknown apples (Blue, Purple), the chef uses a clever trick. They look at the "Red Apple" and the "Green Apple" and say, "If Red + Apple = Red Apple, and Green + Apple = Green Apple, then logically, Blue + Apple should look like a 'Blue Apple'." They create a virtual prototype (a mental image) of the Blue Apple based on the patterns they already know. This prevents the chef from being biased toward only the old, familiar fruits.

2. The "Priority Queue" (The VIP Shelf)

As customers come in, the chef sees many fruits. They can't remember every single one perfectly.

What WARM-CAT does: It keeps a special Priority Queue (a VIP shelf) that holds the top 3 best examples of every fruit it has seen so far.
How it works: If a customer brings a "Blue Apple," the chef looks at it. If the chef is very confident it's a Blue Apple, they put a photo of it on the VIP shelf. If a new customer brings a "Blue Apple" that looks slightly different, the chef checks the shelf. If the new one is a better example, it replaces the old photo.
Why it helps: The AI doesn't just guess; it builds a library of high-quality examples from the current day's customers to help it recognize future customers better.

3. The "Adaptive Update" (Knowing When to Change Your Mind)

Sometimes, a customer brings a fruit that looks weird. Should the chef change their entire recipe book?

The Old Way: The chef might change their mind too easily (forgetting what a Red Apple is) or not at all (stuck on the old ways).
What WARM-CAT does: It uses a smart Adaptive Weight.
- If the new fruit looks very similar to what the chef already knows, the chef makes a tiny, careful adjustment.
- If the new fruit looks very different (like a Blue Apple), the chef makes a bigger, bolder adjustment to learn this new thing.
- This ensures the chef learns new things without forgetting the old ones.

4. The "Double Check" (Text vs. Vision)

The chef has two ways of thinking:

The Text Book: "A Blue Apple is a fruit that is blue."
The Visual Eye: "This object looks round and blue."

What WARM-CAT does: It constantly checks if the Text Book and the Visual Eye agree. If they disagree, it uses a special learning method to make them align. This ensures the AI isn't just guessing based on words or just guessing based on blurry pictures; it combines both for a super-accurate answer.

The New Tools: C-Fashion and MIT-States*

The authors realized that to test this new chef, they needed better test kitchens.

C-Fashion: They created a brand new dataset focused on clothing. Just like clothes have colors and styles (e.g., "Striped Shirt," "Red Dress"), this dataset helps test if the AI can handle fashion trends.
MIT-States:* They found an old dataset (MIT-States) that was full of errors (like labeling a "Red Shirt" as "Blue"). They cleaned it up, like scrubbing a dirty kitchen, to make sure the test results were fair.

The Result

When they tested WARM-CAT against other AI models:

It handled new combinations (like "Blue Apple") much better.
It didn't get confused by rare items (like a "Purple Car" when most cars are Red).
It worked well whether the world was predictable (Closed-World) or full of surprises (Open-World).

Summary

WARM-CAT is an AI that doesn't just memorize; it learns on the job. It starts with a smart guess for new things, keeps a VIP shelf of the best examples it sees, updates its knowledge carefully, and checks its work using both words and pictures. This allows it to recognize brand-new combinations of things that it has never seen before, just like a human would.

1. Problem Statement

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions (e.g., "wilted sunflower") based on knowledge learned from seen compositions (e.g., "fresh rose").

Core Challenge: Existing methods suffer from performance degradation at test time due to label space distribution shift. Models are trained on a fixed set of seen compositions but must predict unseen ones during testing.
Limitations of Current Approaches:
- Most state-of-the-art methods freeze model parameters and class prototypes after training, preventing them from adapting to the new distribution of unseen compositions encountered during testing.
- They often rely solely on textual prototypes (from Vision-Language Models like CLIP) and ignore the potential of accumulating visual knowledge from historical test images.
- Existing benchmarks (e.g., MIT-States) contain significant label noise, and there is a lack of fashion-specific benchmarks for compositional reasoning.

2. Methodology: WARM-CAT

The authors propose WARM-CAT, a framework that leverages unsupervised test-time data to accumulate comprehensive multimodal knowledge and adapt prototypes dynamically.

A. Training Phase

Base Model: A CLIP-based model is fine-tuned using Prompt Tuning (learnable soft tokens for text) and Adapter Tuning (lightweight modules inserted into the visual encoder) on the seen training data.
Objective: Align image and text representations using contrastive learning to establish a strong baseline before testing.

B. Test-Time Adaptation (The Core Innovation)

During the test phase, the model processes a stream of unlabeled images and updates its internal representations without backpropagating through the entire backbone.

Multimodal Prototype Construction:
- Textual Prototypes: Derived from the frozen text encoder of the base model.
- Visual Prototypes: Constructed from a Dynamic Priority Queue that stores high-confidence images (low prediction entropy) for each class.
Priority Queue Warm-Start:
- Problem: If the queue starts empty, the model biases toward compositions it has already seen in the test stream.
- Solution:
  - Seen Compositions: Initialized with visual features from the training set.
  - Unseen Compositions: Since no training images exist, virtual visual prototypes are generated. The authors learn a mapping matrix ( $M$ ) between seen and unseen textual prototypes and apply this mapping to the seen visual prototypes to synthesize initial unseen visual prototypes.
Knowledge Accumulation Module (KAM):
- Instead of updating the frozen backbone, WARM-CAT introduces learnable KAM parameters ( $\Delta t$ and $\Delta v$ ) initialized to zero.
- These modules adjust the textual and visual prototypes based on incoming test samples.
Adaptive Update Weight (AUW):
- To prevent catastrophic forgetting or over-adaptation, the magnitude of the prototype update is controlled by an adaptive weight ( $w_c$ ).
- The weight is calculated based on the cosine similarity between the current test image and the original prototype.
- Logic: If the image is similar to the original prototype (likely a seen composition), updates are minimized. If the image is dissimilar (likely an unseen composition), updates are allowed to be stronger to adapt to the new distribution.
Optimization Objectives:
- Prediction Entropy Minimization ( $L_{PE}$ ): Encourages the model to make confident predictions on the test distribution.
- Multimodal Collaborative Representation Learning ( $L_{MCRL}$ ): A contrastive loss that aligns the updated textual and visual prototypes, ensuring semantic consistency between modalities.
- Total Loss: $L = L_{PE} + \lambda L_{MCRL}$ . The KAM parameters are updated by minimizing this loss, while the inference step uses the updated prototypes.

3. Key Contributions

Novel Framework (WARM-CAT): The first approach to leverage unsupervised test-time data for CZSL to bridge label distribution shifts via multimodal knowledge accumulation.
Warm-Started Priority Queue: A mechanism to initialize visual prototypes for both seen and unseen compositions, preventing bias toward historical test images and ensuring balanced adaptation.
New Benchmarks & Refinements:
- C-Fashion: A new benchmark dataset for compositional reasoning in the fashion domain (based on FashionIQ), addressing a gap in existing literature.
- MIT-States:* A refined, cleaned version of the noisy MIT-States dataset (removing ~70% incorrect labels).
Evaluation Metrics: Introduction of metrics tailored for long-tailed distributions in CZSL (Head/Body/Tail accuracy) to better assess model robustness across frequent and rare classes.
State-of-the-Art Performance: Demonstrated superior results across four datasets in both closed-world and open-world settings.

4. Experimental Results

Datasets: Evaluated on UT-Zappos, C-Fashion, C-GQA, and the refined MIT-States*.
Closed-World Results: WARM-CAT achieved State-of-the-Art (SOTA) performance.
- On UT-Zappos, it improved the Harmonic Mean (HM) from 60.2% (previous SOTA TOMCAT) to 64.3%.
- On C-Fashion, it achieved an HM of 63.6%, outperforming all baselines.
Open-World Results: Consistently outperformed baselines, showing robustness in larger search spaces.
Long-Tailed Analysis: WARM-CAT significantly reduced the performance gap between "Head" (frequent) and "Tail" (rare) classes compared to previous methods, demonstrating better generalization to rare compositions.
Ablation Studies:
- Confirmed that warm-starting the queue is crucial for preventing bias.
- Showed that Adaptive Update Weights are essential to balance stability and plasticity.
- Validated that Multimodal alignment (Text + Visual) is superior to single-modality approaches.

5. Significance

Paradigm Shift: Moves CZSL from a static "train-and-forget" paradigm to a dynamic "continuous learning" paradigm at test time, mimicking human ability to adapt to new contexts without retraining.
Practical Applicability: The method is designed for real-world scenarios where systems encounter distribution shifts and unlabeled user data post-deployment.
Resource Efficiency: By using lightweight adapters and prompt tuning rather than full fine-tuning, the method remains computationally efficient during the test phase.
Community Impact: The release of the C-Fashion dataset and the cleaned MIT-States* provides the community with more reliable and challenging benchmarks for future research in compositional reasoning.