WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

Imagine you are a quality control inspector at a massive factory that makes everything from tiny medical pills to giant car engines. Your job is to spot defects (anomalies) in products.

The problem? You've never seen these specific defects before. Maybe a new type of scratch appeared on a phone screen, or a weird tumor showed up on an X-ray. You don't have a "training manual" with pictures of these specific bad things because they are unseen. This is called Zero-Shot Anomaly Detection.

In the past, computers tried to solve this by using a giant, pre-trained brain (like CLIP, a famous AI that knows what "a cat" or "a broken cup" looks like because it read millions of books and saw millions of photos). But these old methods had two big flaws:

They were too rigid: They used fixed sentences (prompts) like "This is a good product" or "This is a bad product." It's like trying to describe a complex crime scene using only the words "good" or "bad." You miss the details.
They only looked at the "big picture": They looked at the overall shape of the object but missed tiny, subtle cracks or texture changes that happen in the high-frequency details (like the static on an old TV).

Enter WMoE-CLIP, the new superhero inspector. Here is how it works, explained with simple analogies:

1. The "Shape-Shifting" Prompt (CTDS)

The Problem: Old AI inspectors used the same boring sentence for every product.
The Solution: Imagine you have a magical chameleon. Instead of saying "This is a bad apple," the chameleon changes its color and texture to match the specific apple you are looking at.

How it works: The system uses a Variational Autoencoder (VAE). Think of this as a "dream machine." It looks at the whole product, takes a "snapshot" of its general vibe, and then uses that snapshot to rewrite the AI's instructions on the fly. It turns a generic prompt into a personalized, super-detailed description that fits the specific item, making the AI much more adaptable.

2. The "Super-Resolution" Glasses (WCMA)

The Problem: Old AI missed tiny scratches because it only looked at the smooth, low-resolution version of the image.
The Solution: Imagine putting on a pair of magic glasses that can see the invisible.

How it works: The system uses Wavelet Decomposition. Think of an image like a song. The low notes are the melody (the big shapes), and the high notes are the crisp cymbals and hi-hats (the tiny details and textures).
The AI separates the image into these "frequencies." It focuses heavily on the high-frequency parts (the cymbals) where tiny defects hide. It then uses these sharp details to "polish" the AI's text instructions. It's like the AI saying, "Okay, I see a 'good apple,' but now that I'm looking at the high-frequency details, I see a tiny bruise, so I'll update my thought to 'apple with a bruise'."

3. The "Council of Experts" (SA-MoE)

The Problem: Sometimes, one way of looking at a problem isn't enough. A scratch might look different depending on the lighting or the angle.
The Solution: Instead of one inspector making the final call, you hire a Council of Experts.

How it works: This is the Mixture-of-Experts (MoE) module. Imagine a round table with 8 different specialists (Experts).
- Expert 1 is great at spotting scratches.
- Expert 2 is great at spotting color changes.
- Expert 3 is great at spotting weird shapes.
A Router (the meeting moderator) looks at the product and says, "For this specific item, let's listen to Expert 1 and Expert 3." It combines their opinions to make a final, highly accurate decision. This ensures the AI doesn't miss anything because it's listening to the right "voice" for the job.

The Result

When you put all three of these tools together:

Personalized Instructions (The Chameleon)
High-Definition Vision (The Magic Glasses)
A Team of Specialists (The Council)

The result is an AI that can look at a product it has never seen before and say, "I know this is broken," with incredible accuracy.

Why does this matter?
The researchers tested this on 14 different datasets, ranging from industrial factories (checking for broken bottles or skin defects) to hospitals (finding tumors in brains or polyps in guts). The results showed that WMoE-CLIP is the best at its job, beating all previous methods. It's like upgrading from a flashlight to a high-tech laser scanner for finding the tiniest flaws in the world.

1. Problem Statement

Zero-Shot Anomaly Detection (ZSAD) aims to identify unseen anomalies without task-specific training data, leveraging Vision-Language Models (VLMs) like CLIP. However, existing approaches face two critical limitations:

Semantic Rigidity: Most methods rely on fixed or manually crafted textual prompts, which provide sparse semantic information and struggle to adapt to diverse anomaly patterns or complex visual contexts.
Spatial Limitations: Existing models focus primarily on spatial-domain features, often missing subtle defects that are better captured in the frequency domain (e.g., high-frequency textures or fine-grained details).

2. Methodology: WMoE-CLIP

The authors propose WMoE-CLIP, a novel framework built upon the CLIP architecture that integrates three core components to enhance image-text alignment and anomaly scoring:

A. Class Token Distribution Sampling (CTDS)

Goal: To overcome the rigidity of fixed prompts by making them adaptive to specific image contexts.
Mechanism:
- A Variational Autoencoder (VAE) is employed to model the global distribution of image features (extracted from the CLIP image encoder's class token).
- The VAE samples latent variables representing global semantic information.
- These sampled features are fused with learnable prompt vectors (e.g., "a photo of a good/damaged [token]") to create dynamic, context-aware text prompts.
- Loss: Optimized using Kullback-Leibler (KL) divergence and reconstruction loss to ensure the sampled features faithfully represent the global image semantics.

B. Wavelet-Enhanced Cross-Modal Attention (WCMA)

Goal: To capture subtle anomalies by incorporating frequency-domain information into the text-image alignment process.
Mechanism:
- Wavelet Decomposition: The input image features are decomposed using a Haar wavelet transform into low-frequency (global structure) and high-frequency (details/edges) components.
- Feature Fusion: High-frequency sub-bands (horizontal, vertical, diagonal) are aggregated to form a comprehensive high-frequency representation.
- Cross-Attention: A cross-attention mechanism dynamically refines the text embeddings using these frequency-enhanced image features. This allows the text prompts to "focus" on specific frequency details crucial for detecting subtle defects.

C. Semantic-Aware Mixture-of-Experts (SA-MoE)

Goal: To aggregate rich contextual information for robust image-level scoring.
Mechanism:
- Patch features from multiple layers are processed through an Adapter and pooled to create a contextual representation.
- A Mixture-of-Experts (MoE) module, controlled by a routing gate, dynamically selects the top- $k$ experts to process this context.
- The outputs of the selected experts are aggregated to enhance the global class token, which is then used to compute the final anomaly score alongside the pixel-level anomaly map.

D. Loss Function

The model is trained using a joint objective function comprising:

Global Loss: Binary Cross-Entropy (BCE) for image-level classification.
Local Loss: A combination of Focal Loss and Dice Loss for pixel-level segmentation.
VAE Regularization: KL divergence and reconstruction loss.

3. Key Contributions

Novel Framework: Introduction of WMoE-CLIP, the first ZSAD method to combine wavelet-enhanced frequency analysis with a mixture-of-experts prompt learning strategy.
Adaptive Prompting: Development of CTDS, which uses a VAE to inject global semantic distributions into prompts, significantly improving adaptability to unseen categories.
Frequency-Domain Alignment: The WCMA module bridges the gap between spatial and frequency domains, enabling the detection of subtle, high-frequency anomalies that spatial-only models miss.
Contextual Aggregation: The SA-MoE module effectively aggregates multi-layer contextual information, enhancing the robustness of the final anomaly score.

4. Experimental Results

The method was evaluated on 14 datasets (6 industrial and 8 medical), including MVTec-AD, VisA, HeadCT, and ISIC.

Performance: WMoE-CLIP achieved State-of-the-Art (SOTA) performance across all datasets, outperforming strong baselines like WinCLIP, AnomalyCLIP, AdaCLIP, and AA-CLIP.
- Industrial (MVTec-AD): Achieved an Image-level AUROC of 92.4% and Pixel-level AUROC of 92.1%, surpassing the previous best (AA-CLIP) by significant margins.
- Medical: Demonstrated superior generalization, achieving top scores on datasets like HeadCT (98.2% AUROC) and BrainMRI (96.4% AUROC).
Ablation Studies:
- Adding CTDS improved MVTec-AD Image-level AUROC by 1.3%.
- Adding WCMA further improved it by 1.0%, confirming the value of frequency features.
- Adding SA-MoE provided additional gains, with the full model achieving the highest performance.
Visualization: Qualitative results (Fig. 2) show that WMoE-CLIP provides more precise anomaly localization, particularly in challenging medical scenarios where subtle texture changes are critical.

5. Significance

This work addresses the fundamental limitations of current VLM-based anomaly detection by moving beyond static spatial features and fixed prompts.

Generalization: By modeling global distributions and leveraging frequency information, the model generalizes better to diverse industrial and medical scenarios without retraining.
Subtle Defect Detection: The integration of wavelet transforms allows the system to detect minute defects that are often invisible to standard spatial attention mechanisms.
Scalability: The mixture-of-experts approach offers a scalable way to handle complex semantic variations without drastically increasing computational overhead during inference.

In summary, WMoE-CLIP represents a significant advancement in ZSAD, demonstrating that combining probabilistic prompting, frequency-domain analysis, and expert routing creates a more robust and sensitive anomaly detection system.