FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization

Imagine you are a quality control inspector at a massive factory that makes everything from chewing gum to circuit boards. Your job is to spot defects.

The Old Way (Previous Methods):
In the past, inspectors (or AI models) were trained only on "perfect" items. If they saw something that looked weird, they'd shout, "That's broken!" But here's the problem: they didn't know what "broken" looked like for every specific item.

If a wooden table had a scratch, the inspector might say, "That's a scratch."
If a metal nut had a scratch, the inspector might also just say, "That's a scratch."
But what if the wood has a knot or the metal has rust? The generic word "scratch" doesn't help much. It's like trying to find a specific person in a crowd by just saying, "Look for a person," instead of "Look for a person with a red hat and a beard."

Also, the old inspectors were bad at pinpointing exactly where the defect was. They might look at a whole photo and guess, "Maybe the defect is here?" often pointing at the background or empty space by mistake.

The New Solution: FiLo (Fine-Grained Description & High-Quality Localization)
The paper introduces a new AI system called FiLo. Think of FiLo as a super-smart inspector who has a massive, detailed encyclopedia in their head and a pair of laser-guided binoculars.

FiLo works in two main steps:

1. The "Smart Encyclopedia" (Fine-Grained Description)

Instead of just saying "This is normal" or "This is broken," FiLo uses a Large Language Model (LLM)—basically a super-intelligent chatbot—to write a specific description for every possible defect.

The Analogy: Imagine you are looking for a lost item.
- Old Way: You ask, "Is it lost?"
- FiLo Way: You ask, "Is it a chewed piece of gum, a stained piece of gum, or a torn wrapper?"
How it works: Before looking at the image, FiLo asks the AI: "What are all the ways a chewing gum can be defective?" The AI generates a list: sticky, torn, discolored, misshapen.
It then learns to recognize these specific words. This makes the AI much better at spotting the difference between a "perfect" gum and a "torn" gum, rather than just a generic "bad" gum.

2. The "Laser Binoculars" (High-Quality Localization)

Once FiLo knows what to look for, it needs to find where it is. The old methods would scan the whole image pixel by pixel, often getting confused by shadows or the background.

FiLo uses a three-step process to find the defect with laser precision:

The Rough Sketch (Grounding DINO): First, it uses a tool called Grounding DINO to draw a rough box around the object. It's like saying, "Okay, the defect is definitely on this table, not on the floor." This stops the AI from getting distracted by the background.
The Clue (Position Enhancement): It adds the location to its description. Instead of just "torn gum," it thinks, "torn gum on the right side." This helps the AI focus its attention.
The Multi-Shape Net (MMCI): Defects come in all shapes and sizes. A scratch is long and thin; a dent is round and small. FiLo uses a special "net" made of different-sized and shaped filters (like a fishing net with different mesh sizes) to catch defects of any shape. This ensures it doesn't miss a tiny crack or a huge stain.

The Result

When you put these two superpowers together:

FiLo doesn't just say "This is broken." It says, "This is a rusty metal nut located at the top left."
It works on products it has never seen before (Zero-Shot). You don't need to show it 1,000 pictures of rusty nuts to teach it; it just uses its "encyclopedia" to understand what rust looks like.

In Summary:
FiLo is like hiring an inspector who doesn't just have a checklist of "Good vs. Bad," but has a detailed dictionary of every possible flaw and a pair of high-tech glasses that ignore the background and zoom in exactly where the problem is, no matter how small or weirdly shaped it is. This makes it the best at its job, finding defects faster and more accurately than any previous method.

1. Problem Statement

Zero-Shot Anomaly Detection (ZSAD) aims to identify and locate defects in industrial products without prior access to specific normal or abnormal samples from the target category. While existing methods leverage pre-trained Vision-Language Models (VLMs) like CLIP, they face two critical limitations:

Generic Descriptions: Current methods rely on broad, generic text prompts (e.g., "damaged" or "defect") to represent anomalies. These fail to capture the diverse and specific types of defects (e.g., "scratch," "crack," "discoloration") unique to different object categories, leading to poor semantic alignment.
Localization Inaccuracy: Standard approaches compute similarity between image patches and text features directly. This often results in false positives in background regions and struggles to pinpoint anomalies of varying sizes and shapes. Furthermore, methods that use sliding windows (like WinCLIP) incur high computational costs.

2. Methodology: FiLo

The authors propose FiLo, a novel framework comprising two core components: Fine-Grained Description (FG-Des) for detection and High-Quality Localization (HQ-Loc) for segmentation.

A. Fine-Grained Description (FG-Des)

This module addresses the semantic gap in text prompts.

LLM-Generated Descriptions: Instead of using generic terms, the method employs Large Language Models (LLMs) to generate a specific list of potential anomaly types for each object category (e.g., for "Wood," it generates "knots," "warping," "cracks along the grain").
Adaptively Learned Text Templates: Replacing manually crafted templates (e.g., "A photo of a [state] [class]"), FiLo uses learnable text vectors ( $[v_1]...[v_n]$ $[v_{1}] ... [v_{n}]$ ).
- Normal Template: $[V_1]...[V_n][STATE][CLASS]$
- Abnormal Template: $[W_1]...[W_n][STATE][CLASS] \text{ with } [ANOMALY CLASS] \text{ at } [POS]$
- The LLM-generated specific anomaly content fills the $[ANOMALY CLASS]$ slot, and the learnable vectors adapt to the specific distribution of the anomaly detection task.

B. High-Quality Localization (HQ-Loc)

This module improves the precision of anomaly localization through a three-step pipeline:

Grounding DINO Preliminary Localization:
- The method uses Grounding DINO with the fine-grained text prompts to generate preliminary bounding boxes for anomalies.
- Purpose: Since anomalies usually appear in the foreground, these boxes help filter out background noise. Regions outside these boxes are suppressed to reduce false positives.
Position-Enhanced Text Prompts:
- The coordinates of the detected bounding boxes are encoded into the text prompt (e.g., "at [top left]"). This guides the model to focus on specific spatial regions during feature matching.
Multi-Scale Multi-Shape Cross-modal Interaction (MMCI) Module:
- To handle anomalies of various sizes and shapes without the computational overhead of sliding windows, FiLo introduces the MMCI module.
- It applies convolutional kernels of different sizes and shapes (e.g., $1\times1, 3\times3, 5\times5, 1\times5, 5\times1$ ) to the patch features extracted by the CLIP Image Encoder.
- These features are aggregated and compared with the position-enhanced text features to generate a high-resolution anomaly map.

C. Training and Loss Functions

Global Loss: Cross-entropy loss is used to optimize the global anomaly score (image-level classification).
Local Loss: A combination of Focal Loss (to handle class imbalance between normal and anomalous pixels) and Dice Loss is used to optimize the pixel-level anomaly map.
Adapter: A bottleneck adapter layer aligns global image features with text features.

3. Key Contributions

Fine-Grained Description (FG-Des): The first application of visual description enhancement techniques to ZSAD. It replaces generic anomaly labels with LLM-generated, category-specific descriptions and utilizes learnable text vectors to improve semantic matching.
High-Quality Localization (HQ-Loc): A novel localization strategy that combines Grounding DINO for background suppression, position-enhanced prompts for spatial awareness, and the MMCI module for multi-scale/multi-shape feature aggregation.
State-of-the-Art Performance: FiLo achieves superior results on standard industrial benchmarks without requiring target-category training data.

4. Experimental Results

The method was evaluated on the MVTec AD and VisA datasets, with training on one and zero-shot testing on the other.

VisA Dataset (Trained on MVTec):
- Image-level AUC: 83.9% (Improvement of +1.1% over the previous SOTA, AnomalyCLIP).
- Pixel-level AUC: 95.9% (Improvement of +0.4% over AnomalyCLIP).
MVTec Dataset (Trained on VisA):
- Pixel-level AUC: 92.3% (Improvement of +1.2% over AnomalyCLIP).
Ablation Studies:
- Removing the LLM-generated fine-grained descriptions significantly drops performance.
- Removing the MMCI module or position enhancement reduces localization accuracy, confirming the necessity of multi-scale processing and spatial guidance.
- Using learnable vectors outperforms fixed manual templates.

5. Significance

FiLo represents a significant advancement in industrial quality control by solving the "semantic gap" and "spatial ambiguity" problems inherent in current zero-shot anomaly detection.

Interpretability: By generating specific anomaly types (e.g., identifying a "crack" vs. a "stain"), the model provides more interpretable results than binary "normal/abnormal" classifications.
Efficiency: The MMCI module achieves multi-scale detection without the heavy computational cost of sliding-window approaches used in previous methods.
Generalization: The framework demonstrates robust generalization across diverse object categories and defect types, making it highly practical for real-world deployment where defect data is scarce or privacy-sensitive.

FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization

1. The "Smart Encyclopedia" (Fine-Grained Description)

2. The "Laser Binoculars" (High-Quality Localization)

The Result

1. Problem Statement

2. Methodology: FiLo

A. Fine-Grained Description (FG-Des)

B. High-Quality Localization (HQ-Loc)

C. Training and Loss Functions

3. Key Contributions

4. Experimental Results

5. Significance

More like this

DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph

How unconstrained machine-learning models learn physical symmetries

Experiential Reflective Learning for Self-Improving LLM Agents

Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks

Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions