WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

This paper proposes WMoE-CLIP, a zero-shot anomaly detection method that overcomes the limitations of fixed prompts and spatial-only features by integrating variational autoencoder-based semantic modeling, wavelet decomposition for multi-frequency feature refinement, and a semantic-aware mixture-of-experts module, achieving state-of-the-art performance across 14 industrial and medical datasets.

Peng Chen, Chao Huang

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are a quality control inspector at a massive factory that makes everything from tiny medical pills to giant car engines. Your job is to spot defects (anomalies) in products.

The problem? You've never seen these specific defects before. Maybe a new type of scratch appeared on a phone screen, or a weird tumor showed up on an X-ray. You don't have a "training manual" with pictures of these specific bad things because they are unseen. This is called Zero-Shot Anomaly Detection.

In the past, computers tried to solve this by using a giant, pre-trained brain (like CLIP, a famous AI that knows what "a cat" or "a broken cup" looks like because it read millions of books and saw millions of photos). But these old methods had two big flaws:

  1. They were too rigid: They used fixed sentences (prompts) like "This is a good product" or "This is a bad product." It's like trying to describe a complex crime scene using only the words "good" or "bad." You miss the details.
  2. They only looked at the "big picture": They looked at the overall shape of the object but missed tiny, subtle cracks or texture changes that happen in the high-frequency details (like the static on an old TV).

Enter WMoE-CLIP, the new superhero inspector. Here is how it works, explained with simple analogies:

1. The "Shape-Shifting" Prompt (CTDS)

The Problem: Old AI inspectors used the same boring sentence for every product.
The Solution: Imagine you have a magical chameleon. Instead of saying "This is a bad apple," the chameleon changes its color and texture to match the specific apple you are looking at.

  • How it works: The system uses a Variational Autoencoder (VAE). Think of this as a "dream machine." It looks at the whole product, takes a "snapshot" of its general vibe, and then uses that snapshot to rewrite the AI's instructions on the fly. It turns a generic prompt into a personalized, super-detailed description that fits the specific item, making the AI much more adaptable.

2. The "Super-Resolution" Glasses (WCMA)

The Problem: Old AI missed tiny scratches because it only looked at the smooth, low-resolution version of the image.
The Solution: Imagine putting on a pair of magic glasses that can see the invisible.

  • How it works: The system uses Wavelet Decomposition. Think of an image like a song. The low notes are the melody (the big shapes), and the high notes are the crisp cymbals and hi-hats (the tiny details and textures).
  • The AI separates the image into these "frequencies." It focuses heavily on the high-frequency parts (the cymbals) where tiny defects hide. It then uses these sharp details to "polish" the AI's text instructions. It's like the AI saying, "Okay, I see a 'good apple,' but now that I'm looking at the high-frequency details, I see a tiny bruise, so I'll update my thought to 'apple with a bruise'."

3. The "Council of Experts" (SA-MoE)

The Problem: Sometimes, one way of looking at a problem isn't enough. A scratch might look different depending on the lighting or the angle.
The Solution: Instead of one inspector making the final call, you hire a Council of Experts.

  • How it works: This is the Mixture-of-Experts (MoE) module. Imagine a round table with 8 different specialists (Experts).
    • Expert 1 is great at spotting scratches.
    • Expert 2 is great at spotting color changes.
    • Expert 3 is great at spotting weird shapes.
  • A Router (the meeting moderator) looks at the product and says, "For this specific item, let's listen to Expert 1 and Expert 3." It combines their opinions to make a final, highly accurate decision. This ensures the AI doesn't miss anything because it's listening to the right "voice" for the job.

The Result

When you put all three of these tools together:

  1. Personalized Instructions (The Chameleon)
  2. High-Definition Vision (The Magic Glasses)
  3. A Team of Specialists (The Council)

The result is an AI that can look at a product it has never seen before and say, "I know this is broken," with incredible accuracy.

Why does this matter?
The researchers tested this on 14 different datasets, ranging from industrial factories (checking for broken bottles or skin defects) to hospitals (finding tumors in brains or polyps in guts). The results showed that WMoE-CLIP is the best at its job, beating all previous methods. It's like upgrading from a flashlight to a high-tech laser scanner for finding the tiniest flaws in the world.