Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention

Imagine you are a quality control inspector at a massive factory that makes everything from toothbrushes to circuit boards. Your job is to spot the one defective item on a conveyor belt full of thousands of perfect ones.

The problem? Defects are rare. You might only see a broken screw once a month. Because you don't have enough "bad" examples to study, your brain (or your computer program) doesn't know what a broken screw looks like until it's too late.

This paper proposes a clever two-part solution to fix this: 1. A "Magic Imagination Machine" to create fake defects for training, and 2. A "Super-Sharp Lens" to help the computer see the tiny flaws better.

Here is how it works, broken down into simple analogies:

Part 1: The "Magic Imagination Machine" (FMAS)

The Problem: Usually, to teach a computer to spot a defect, you need to show it thousands of pictures of broken things. But in a factory, broken things are rare. If you try to make fake broken things by just cutting and pasting pieces of paper (old methods), they look obvious and fake, like a bad Photoshop job.

The Solution: The authors built a pipeline using Foundation Models (the same super-smart AI brains behind tools like ChatGPT and image generators). Think of this as a team of three experts working together:

The Architect (GPT-4): It reads the picture of a perfect object (like a bottle) and writes a creative story: "Imagine a crack in the glass here, or a dent in the metal there." It knows exactly what a "broken bottle" sounds like in words.
The Sculptor (SAM): It looks at the picture and says, "Okay, I see the bottle. I will draw a box around just the bottle so we don't accidentally break the table behind it."
The Painter (Stable Diffusion): It takes the Architect's story and the Sculptor's box, and paints a realistic crack or dent right onto the bottle.

The Result: They can generate thousands of hyper-realistic fake defects without ever needing to train the AI on real broken items. It's like practicing your driving skills in a perfect video game simulator before hitting the real road.

Part 2: The "Super-Sharp Lens" (WDAM)

The Problem: Even with great training data, computers sometimes miss tiny defects. They look at an image as a whole picture, like looking at a forest and missing a single broken branch. They get distracted by the overall shape or color.

The Solution: The authors realized that defects often look different depending on how you "zoom in" on the details. They used a mathematical tool called Wavelet Transform, which is like taking a photo and separating it into four different "layers" of detail:

The Smooth Layer (LL): The big, blurry shapes (like the overall color of the bottle).
The Edge Layers (LH, HL, HH): The sharp lines, textures, and tiny cracks.

The "Lens" (Attention Module):
Imagine you are looking at a painting. A normal computer looks at the whole canvas equally. This new module, WDAM, acts like a smart spotlight.

It looks at the "Smooth Layer" and says, "This looks fine, ignore it."
It looks at the "Edge Layers" and says, *"Wait! There's a weird texture here! Turn up the brightness on this part!"*

It dynamically decides which "layer" of detail is most important for finding a defect. If a defect is a scratch, it focuses on the high-frequency edge layers. If it's a stain, it focuses on the texture layers. It amplifies the signal of the defect and mutes the noise of the background.

Putting It All Together

Training: They use the Magic Imagination Machine to create a massive library of fake, realistic broken items. The computer learns what defects look like without needing real broken items.
Inspection: When the computer inspects a real product, it uses the Super-Sharp Lens to ignore the boring background and zoom in specifically on the tiny, weird patterns that signal a defect.

Why This Matters

No Fine-Tuning: You don't need to retrain the whole AI for every new product. It just works.
Plug-and-Play: The "Super-Sharp Lens" (WDAM) can be added to almost any existing computer vision system, like adding a turbocharger to a car.
Better Results: In tests, this combination found defects much more accurately than previous methods, especially on tricky items like screws or fabric.

In short: They taught the computer to imagine its own training data so it knows what to look for, and then gave it a special pair of glasses that highlights the tiny cracks while ignoring the rest of the world.

1. Problem Statement

Industrial visual anomaly detection faces two primary challenges:

Data Scarcity: Real-world anomalous samples are rare, making supervised learning difficult. Consequently, most methods rely on unsupervised paradigms using only normal data.
Limitations of Existing Synthesis: To overcome data scarcity, researchers use anomaly synthesis. However:
- Non-generative methods (e.g., CutPaste, DRAEM) often produce unrealistic anomalies that lack visual fidelity and fail to capture complex statistical properties of real defects.
- Generative methods (e.g., GANs, Diffusion models) produce realistic anomalies but typically require extensive training, fine-tuning, or class-specific adaptation, hindering their deployment in new scenarios.
Feature Extraction Inefficiency: Standard spatial-domain attention mechanisms often fail to distinguish between normal texture variations and subtle defect patterns, as anomalies often manifest as specific frequency-domain disruptions rather than global appearance changes.

2. Methodology

The authors propose a two-pronged approach: a training-free anomaly synthesis pipeline (FMAS) and a plug-and-play feature enhancement module (WDAM).

A. Foundation Model-based Anomaly Synthesis (FMAS)

FMAS generates high-fidelity synthetic anomalies without model fine-tuning by integrating three foundation models:

Prompt Generation (GPT-4): Automatically generates descriptive and negative prompts based on the object class (e.g., "PCB") to guide the synthesis toward semantically meaningful defects.
Mask Generation (SAM): The Segment Anything Model (SAM) extracts the foreground object mask. To ensure the anomaly is localized on the object (not the background), a rectangular mask is generated and intersected with the foreground mask ( $M = M_{fg} \cap M_{rect}$ ).
Image Synthesis (Stable Diffusion): Uses the normal image, the generated mask, and the text prompts to perform inpainting, creating plausible anomalies within the masked region.
Quality Control (Selector): A filtering mechanism based on LPIPS (Learned Perceptual Image Patch Similarity) selects the best synthetic variant. It chooses the sample closest to a reference threshold to avoid trivial or overly distorted outputs.

B. Wavelet Domain Attention Module (WDAM)

Motivated by the observation that anomalies exhibit distinct saliency across different frequency sub-bands, WDAM is designed to enhance feature extraction:

Decomposition: Input feature maps are decomposed into four frequency sub-bands (LL, LH, HL, HH) using Discrete Wavelet Transform (DWT) via fixed convolutional kernels.
Adaptive Attention: The module applies both Average and Max pooling to the concatenated sub-bands. A Multi-Layer Perceptron (MLP) processes these pooled features to generate learnable attention weights for each sub-band.
Re-weighting & Reconstruction: The weights are broadcast and applied element-wise to the sub-bands to amplify anomaly-sensitive frequencies (often high-frequency details) while suppressing irrelevant noise. The weighted sub-bands are then reconstructed into the spatial domain using Inverse DWT (IDWT).
Integration: WDAM operates as a residual block, seamlessly inserted into existing backbone networks (e.g., CutPaste, DRAEM, PatchCore) without altering the overall architecture.

3. Key Contributions

FMAS Pipeline: A novel, training-free synthesis pipeline that leverages GPT-4, SAM, and Stable Diffusion to generate realistic, semantically diverse anomalies. It constructs synthetic datasets perfectly aligned with MVTec AD and VisA benchmarks.
WDAM Module: A plug-and-play attention mechanism that operates in the wavelet domain. It dynamically learns the importance of different frequency sub-bands (LL, LH, HL, HH) to enhance defect saliency, addressing the uneven distribution of anomaly features across frequencies.
Comprehensive Validation: Extensive experiments demonstrate that combining FMAS and WDAM yields consistent performance improvements across multiple architectures (CutPaste, DRAEM, PatchCore) and datasets (MVTec AD, VisA).

4. Experimental Results

The method was evaluated on the MVTec AD and VisA datasets using Image-level AUROC, Pixel-level AUROC, and Per-Region Overlap (PRO).

Synthesis Quality: FMAS achieved superior realism compared to existing methods (GLASS, RealNet), evidenced by lower FID (56.99) and LPIPS (0.130) scores.
Performance Gains (MVTec AD):
- CutPaste Baseline: 93.23% (Image AUROC).
- Final Model (FMAS + WDAM): 98.00% (Image AUROC), 88.28% (Pixel AUROC), 72.04% (PRO).
- This represents a 4.77% increase in Image AUROC and a 14.53% increase in PRO over the baseline.
Performance Gains (VisA):
- The final model achieved 95.3% Image AUROC, outperforming competitors like EawT and AST.
Ablation Studies:
- FMAS contributed the largest gains (+3.29% Image AUROC), proving the value of high-quality synthetic data.
- WDAM provided consistent improvements (+1.46% Image AUROC) across all tested backbones.
- Placement: Inserting WDAM in Layer 1 of the network yielded the best trade-off between accuracy and computational efficiency.
Efficiency: The addition of WDAM increased model parameters by only 0.19 MB and FLOPs by 0.02 G, with a negligible increase in inference time (~1.6ms).

5. Significance

Zero-Shot Adaptability: By utilizing foundation models for synthesis, the approach eliminates the need for task-specific training or fine-tuning, making it highly adaptable to new industrial scenarios.
Frequency-Aware Detection: WDAM introduces a paradigm shift by treating wavelet sub-bands as semantically distinct components. This allows the model to specifically target the frequency characteristics of defects, which are often missed by standard spatial attention.
Plug-and-Play Utility: Both the data synthesis strategy and the WDAM module are designed to be easily integrated into existing state-of-the-art anomaly detection frameworks, offering a practical path to boost performance without redesigning entire networks.
Visual Fidelity: The qualitative results show that the method produces heatmaps that are significantly more concentrated on actual defects and less noisy in the background compared to baseline methods.

In conclusion, this paper effectively bridges the gap between data scarcity and model generalization in industrial anomaly detection by leveraging the generative power of foundation models and the analytical precision of wavelet-domain attention.

Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention

Part 1: The "Magic Imagination Machine" (FMAS)

Part 2: The "Super-Sharp Lens" (WDAM)

Putting It All Together

Why This Matters

1. Problem Statement

2. Methodology

A. Foundation Model-based Anomaly Synthesis (FMAS)

B. Wavelet Domain Attention Module (WDAM)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization