A Multihead Continual Learning Framework for Fine-Grained Fashion Image Retrieval with Contrastive Learning and Exponential Moving Average Distillation

Imagine you are running a very high-end fashion boutique. Your goal is to help customers find the perfect item based on very specific details: "I want a skirt that is knee-length," or "Show me coats with a specific type of collar."

The Problem: The "Re-Training" Nightmare
Currently, most computer systems that do this are like a student who has to go back to school and re-learn everything every time a new fashion trend appears.

If you teach the system to recognize "sleeve length," and then a customer asks about "collar design," the old system forgets how to measure sleeves and has to re-study the entire database from scratch.
This is slow, expensive, and impractical. It's like a librarian who has to re-shelve every single book in the library just to add one new book about a new author.

The Solution: MCL-FIR (The "Specialized Team" Approach)
The authors of this paper propose a new system called MCL-FIR. Instead of one giant brain trying to remember everything at once, they built a team of specialists.

Here is how it works, using simple analogies:

1. The "Multi-Head" Team (Specialized Experts)

Imagine your fashion AI isn't one person, but a team of experts.

The Shared Brain: Everyone on the team shares a common knowledge base (the "Image Encoder") that knows what a piece of clothing looks like generally.
The Specialized Hats (Heads): When the team needs to learn about "sleeve length," they put on a specific "Sleeve Hat." When they need to learn about "collar design," they swap to a "Collar Hat."
The Magic: When a new trend arrives (e.g., "new fabric texture"), the team just puts on a new hat. They don't have to re-learn how to see clothes; they just add a new tool to their belt. This means they never forget how to measure sleeves while learning about fabrics.

2. The "Double-Date" Strategy (Simpler Learning)

Old systems learned by playing a game of "Find the odd one out." They would show the computer three items:

Item A (The target)
Item B (A similar match)
Item C (A totally different match)
The computer had to figure out why A and B were closer than A and C. This is like trying to find a needle in a haystack while juggling three balls. It's slow and confusing.

MCL-FIR changes the game. It uses a "Double-Date" approach (InfoNCE loss).

It only looks at two items at a time: The Target and its Perfect Match.
It asks, "How similar are these two?"
The Analogy: Instead of trying to find the best match among a crowd of 100 strangers, the computer just focuses on the one person holding hands with the target. It's much faster, requires less computing power, and is less likely to get confused.

3. The "Photographic Memory" Teacher (EMA Distillation)

One of the biggest fears in AI is "Catastrophic Forgetting"—where learning something new makes the AI forget everything it knew yesterday.

To stop this, MCL-FIR uses a "Teacher" system.

Imagine the main AI is a student taking notes.
The "Teacher" is a slow-moving, calm version of the student (an Exponential Moving Average). The Teacher remembers the average of everything the student has learned over time.
As the student learns new things, the Teacher gently reminds them, "Hey, don't forget how to measure a skirt; you were really good at that last week."
This ensures the AI stays stable and doesn't lose its old skills while picking up new ones.

The Results: Fast, Cheap, and Accurate

The paper tested this system on thousands of fashion images.

Efficiency: It achieved results just as good as the old, slow systems but used only 30% of the training time and cost.
Scalability: It can keep adding new attributes (like "shoe type" or "hat style") without ever crashing or forgetting the old ones.
Real-World Ready: It works even when the data is messy or when the order in which it learns things changes.

In Summary:
Think of MCL-FIR as a fashion consultant who doesn't need to go back to fashion school every time a new style drops. Instead, they have a modular toolkit where they can instantly grab a new "style guide" (the head), learn the new trend quickly using a simplified method (the doublet), and keep their old knowledge safe thanks to a gentle reminder system (the teacher). This makes finding the perfect outfit faster, cheaper, and smarter for everyone.

1. Problem Statement

Fine-Grained Fashion Image Retrieval (FIR) aims to retrieve fashion items based on subtle visual differences (e.g., specific collar designs or sleeve lengths) rather than general categories.

Current Limitations: Most existing FIR methods operate in a static setting. When new attributes (e.g., a new type of neckline) emerge, the entire model must be retrained from scratch using all historical data. This is computationally expensive, time-consuming, and impractical for real-world dynamic systems where user requirements evolve continuously.
Zero-Shot/Pretrained Limitations: While large pretrained models (like CLIP) support zero-shot inference, their accuracy drops significantly without supervision on specific attributes. Furthermore, prompt-tuning approaches only adapt the semantic space and fail to learn new attribute-specific visual cues or mitigate catastrophic forgetting when entirely new attributes appear.
The Gap: There is a lack of Class-Incremental Learning (CIL) frameworks specifically designed for fine-grained FIR that can efficiently integrate new attributes while preserving performance on previously learned ones.

2. Methodology: MCL-FIR

The authors propose MCL-FIR, a multihead continual learning framework that integrates contrastive learning and Exponential Moving Average (EMA) distillation.

A. Overall Architecture

Multi-Head Design: Instead of a single shared output head, the model employs lightweight, task-specific attention heads ( $Attn_i$ ) for each attribute (task). This allows the model to learn new attributes without modifying or interfering with the parameters of previously learned heads.
Shared Backbone: A shared image encoder (ResNet-50) and a frozen pre-trained text encoder (CLIP) extract features. The text encoder processes attribute names (e.g., "skirt-length") into embeddings.
Text-Guided Attention: The model uses a text-guided attention module to focus on relevant spatial regions based on the query attribute. It combines spatial attention (highlighting specific image regions) and channel attention (reweighting feature channels) to generate attribute-aware representations.

B. Key Technical Innovations

InfoNCE Loss with Doublets (Replacing Triplets):
- Traditional FIR methods rely on triplet loss (Anchor, Positive, Negative), which requires complex and unstable triplet sampling.
- MCL-FIR reformulates the input into doublets (Anchor, Positive) and utilizes the InfoNCE loss. This simplifies training, removes the need for negative sampling, and reduces computational cost by approximately one-third while maintaining strong contrastive signals.
EMA Distillation:
- To prevent catastrophic forgetting (where learning new tasks degrades performance on old ones), MCL-FIR employs an EMA Teacher model.
- The teacher model ( $T_{EMA}$ ) is a momentum-updated average of the student model's parameters.
- A Knowledge Distillation Loss ( $L_{kd}$ ) is computed between the student and the teacher using Mean Squared Error (MSE) on feature embeddings. This provides stable temporal supervision, ensuring the model retains knowledge of previous attributes.
Training Pipeline:
- The model learns attributes sequentially. For a new task $i$ , only the specific attention head $Attn_i$ and the shared encoder are updated.
- The total loss function is: $L = L_{ins} + \lambda L_{kd}$ , where $L_{ins}$ is the InfoNCE instance contrastive loss and $\lambda$ is a weighting factor (set to 0.0001).

3. Key Contributions

First CIL Framework for Fine-Grained FIR: The paper introduces a scalable multihead framework specifically tailored for the challenges of fine-grained fashion retrieval, enabling the integration of new attributes without retraining the entire system.
Efficiency via InfoNCE and Doublets: By replacing triplet sampling with doublet sampling via InfoNCE loss, the method significantly reduces computational overhead and training complexity.
Robustness via EMA Distillation: The integration of EMA distillation effectively mitigates catastrophic forgetting, allowing the model to maintain high accuracy on previously learned attributes while learning new ones.
Performance-Efficiency Balance: The method achieves state-of-the-art (SOTA) static performance levels while requiring only ~30% of the training cost compared to static retraining methods.

4. Experimental Results

The framework was evaluated on four datasets: FashionAI, DeepFashion, DARN, and Zappos50K.

Comparison with Baselines:
- MCL-FIR significantly outperformed two representative CIL baselines (Experience Replay and a standard Multi-head approach) across all datasets.
- It achieved Mean Average Precision (mAP) scores comparable to SOTA static methods (e.g., ASENet V2, RPF) which require full retraining.
Training Efficiency:
- Time Savings: MCL-FIR reduced total training time significantly. For example, on FashionAI, static methods took ~121 hours, while MCL-FIR required only ~65 hours for the sequential learning of all attributes.
- Parameter Efficiency: The multi-head design adds only ~0.246M parameters per task, which is negligible compared to the 20M-parameter ResNet-50 backbone.
Ablation Studies:
- Removing the distillation loss ( $L_{kd}$ ) caused a significant drop in performance, confirming its role in preventing forgetting.
- Switching from triplets to doublets (InfoNCE) improved performance and reduced sampling complexity.
- The model demonstrated robustness across different dataset orders and batch sizes.
Cross-Dataset Generalization: When extended to the Zappos50K dataset (footwear) after learning apparel attributes, MCL-FIR maintained high performance on previous tasks with almost no catastrophic forgetting, demonstrating excellent stability and plasticity.

5. Significance

Practical Applicability: MCL-FIR solves a critical bottleneck in real-world fashion systems: the inability to adapt to new trends or attributes without costly retraining.
Scalability: The modular architecture allows fashion platforms to continuously add new attributes (e.g., "sustainable material," "new cut") seamlessly.
Resource Efficiency: By drastically reducing training time and computational resources, it makes advanced fine-grained retrieval feasible for smaller companies or real-time applications.
Theoretical Advancement: It bridges the gap between Class-Incremental Learning and Fine-Grained Retrieval, demonstrating that contrastive learning and distillation can be effectively adapted for attribute-guided visual search.

In conclusion, MCL-FIR represents a significant step forward in making fashion image retrieval systems dynamic, efficient, and capable of lifelong learning.