TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

Imagine you are a quality control inspector at a factory. Your job is to spot defective products on a conveyor belt. In the past, you needed to see thousands of pictures of "perfect" products and thousands of pictures of "broken" products to learn what to look for.

But what if you've never seen this specific product before? What if you only have a few examples, or none at all? This is the problem of Zero-Shot Anomaly Detection. You need to find the "bad" stuff without having studied the "bad" stuff beforehand.

For a while, computers solved this using a smart tool called CLIP. Think of CLIP as a very well-read librarian who knows how to match pictures with words. If you show it a picture of a broken widget and say "This is broken," it understands. But CLIP has a flaw: it's a bit of a "big picture" thinker. It's great at saying, "Yes, this whole image looks broken," but it's terrible at pointing exactly where the crack is. It's like a librarian who can tell you a book is about a fire, but can't point to the specific page where the fire starts.

Previous attempts to fix this involved building complex, Rube-Goldberg-style machines around the librarian to force it to look closer. These machines were heavy, complicated, and sometimes made the librarian forget what it already knew.

The New Approach: "TIPS" (The Smart Librarian)

This paper introduces a new, smarter librarian named TIPS. Unlike the old one, TIPS was trained specifically to pay attention to the spatial details—where things are in the picture. It's naturally better at spotting the exact location of a crack.

However, even TIPS has a hiccup. When it looks at a whole picture (global view) and when it looks at a tiny patch of the picture (local view), it speaks two slightly different "dialects."

Global TIPS: "This whole image is weird."
Local TIPS: "This tiny square here is weird."

If you try to mix these two voices directly, they get confused, and the computer makes mistakes.

The Solution: "Decoupled Prompts" (The Two-Headed Strategy)

The authors realized that instead of forcing TIPS to speak one language, they should let it use two different strategies for two different jobs. They call this Decoupled Prompts.

Think of it like a detective team with two specialists:

The "Big Picture" Detective (Fixed Prompts):
- Job: Decide if the entire image is defective.
- Method: This detective uses a pre-written, perfect script (Fixed Prompts) like "A photo of a flawless widget" vs. "A photo of a broken widget." They don't change the script; they just read it perfectly. This is great for a quick "Yes/No" answer.
The "Microscope" Detective (Learnable Prompts):
- Job: Find the exact spot of the defect.
- Method: This detective is allowed to learn and tweak their own notes (Learnable Prompts) specifically to find tiny cracks, scratches, or weird textures. They ignore the big picture and focus entirely on the details.

The Magic Trick:
The system runs both detectives.

The "Big Picture" detective gives a score for the whole image.
The "Microscope" detective draws a map of exactly where the bad spots are.
The Final Score: The system takes the "Big Picture" score and adds the strongest signal from the "Microscope" map. It's like saying, "The whole image looks suspicious, and here is the specific evidence proving it."

Why This Matters

The paper tested this new "Tipsomaly" system on 14 different datasets, ranging from industrial metal parts to medical scans (like brain MRIs).

The Result: It beat the previous best methods (which used the old, clunky CLIP system) in almost every category.
The Efficiency: It did this without building a massive, complex machine. It's like upgrading a car engine rather than adding a jetpack to the roof. It's lighter, faster, and just works better.
The Analogy: Imagine trying to find a needle in a haystack. The old way was to build a giant, noisy magnet that sometimes pulled up the whole haystack. The new way is to have a quiet, precise metal detector (TIPS) that knows exactly where to look, guided by a smart team of two detectives working in harmony.

In a Nutshell

The paper says: "Stop trying to fix the old, blurry tools with complicated hacks. Instead, use a sharper tool (TIPS) and let it use two different strategies—one for the big picture and one for the details—to solve the problem simply and effectively."

This approach allows computers to spot defects in safety-critical areas (like factories or hospitals) even when they've never seen that specific type of defect before, making our world safer and more efficient.

1. Problem Definition

The paper addresses Zero-Shot Anomaly Detection (ZSAD), a critical task in safety-critical domains (industrial inspection, medical imaging) where labeled normal data for the target domain is unavailable.

The Challenge: Existing ZSAD methods rely heavily on CLIP (Contrastive Language-Image Pre-training). However, CLIP suffers from two main limitations in this context:
1. Spatial Misalignment: CLIP's contrastive objective does not enforce patch-level alignment between image patches and text, leading to poor localization of fine-grained anomalies.
2. Weak Sensitivity: CLIP lacks sensitivity to subtle, fine-grained deviations.
Current Limitations: Prior works attempt to fix CLIP's flaws by adding complex auxiliary modules (e.g., attention mechanisms, trainable visual prompts, feature adapters). These approaches often increase architectural complexity, risk overfitting to source data, and fail to fully restore spatial coherence.

2. Methodology: The Tipsomaly Framework

The authors propose Tipsomaly, a framework that replaces the CLIP backbone with TIPS (Text-Image Pretraining with Spatial Awareness) and utilizes a decoupled prompting strategy to avoid complex architectural modifications.

A. Backbone Selection: TIPS

Instead of CLIP, the authors use TIPS, a vision-language model trained with spatially aware objectives.

Advantage: TIPS inherently possesses better patch-text grounding and spatial coherence than CLIP.
New Challenge: Directly using TIPS reveals a distributional gap between its global features (used for image-level classification) and local features (used for pixel-level segmentation). Training prompts to optimize both simultaneously degrades one or the other.

B. Decoupled Prompting Strategy

To bridge the gap between global and local features, the authors employ a dual-prompt approach:

Fixed Prompts (Image-Level Detection):
- Uses static, hand-crafted text templates (e.g., "A photo of a {STATE} {CLASS}").
- Generates fixed text prototypes ( $G_f$ ) for normal and abnormal states.
- Usage: These are compared against TIPS's spatial global token ( $g^s_i$ ) to generate the image-level anomaly score.
Learnable Prompts (Pixel-Level Localization):
- Uses class-agnostic, learnable token sequences ( $T^n, T^a$ ) optimized specifically for the source domain.
- Generates text prototypes ( $G_l$ ) specialized for fine-grained alignment.
- Usage: These are compared against TIPS's dense patch embeddings ( $Z_M$ ) to generate the pixel-level anomaly map.
- Training: These prompts are trained using local loss functions (Focal Loss and Dice Loss) only, avoiding the global classification objective that causes the distributional mismatch.

C. Inference and Scoring

Pixel-Level: The anomaly map is computed by calculating the similarity between patch embeddings and the learnable localization prototypes. The map is upsampled and smoothed.
Image-Level: The final image-level score ( $\hat{y}$ $\overset{y}{^}$ ) is a combination of:
1. The global score derived from the spatial token and fixed prototypes.
2. The strongest local evidence (the maximum pixel-level anomaly score, $\max(\hat{S}_a)$ ).
- Formula: $\hat{y} = p_a(g^s_i, G_f) + \max(\hat{S}_a)$ .

3. Key Contributions

Backbone Re-evaluation: The paper demonstrates that switching from CLIP to a spatially aware backbone (TIPS) is more effective than applying complex "tricks" to CLIP.
Decoupled Prompting: The authors identify and solve the distributional gap between global and local features in VLMs by separating the prompting strategy: fixed prompts for global detection and learnable prompts (trained with local loss only) for localization.
Simple yet Effective Pipeline: The method achieves state-of-the-art results without complex adapters, attention modules, or joint tuning of encoders, maintaining a lean architecture.
Comprehensive Evaluation: The framework is validated across 14 diverse datasets covering both industrial defects (MVTec-AD, VisA, etc.) and medical abnormalities (ISIC, BrainMRI, etc.), proving strong cross-domain generalization.

4. Experimental Results

The method was evaluated on 14 datasets (7 industrial, 7 medical) against three strong CLIP-based baselines (VAND, AnomalyCLIP, AdaCLIP).

Industrial Performance:
- Image-Level: Improved AUROC by +2.3%, AP by +3.9%, and F1-max by +1.1% on average.
- Pixel-Level: Improved AUROC by +2.0%, AUPRO by +6.9%, and F1-max by +1.5%.
Medical Performance:
- Showed remarkable generalization, improving pixel-level AUROC by +3.2%, AUPRO by +4.4%, and F1-max by +5.3% on average.
Ablation Studies:
- Confirmed that decoupled prompting outperforms using only fixed or only learnable prompts.
- Showed that training learnable prompts with local loss only yields the best segmentation, while adding global loss degrades localization.
- Demonstrated that injecting the max local evidence into the global score boosts image-level detection.

5. Significance

This paper shifts the paradigm in Zero-Shot Anomaly Detection from "fixing CLIP with complex modules" to "choosing the right backbone and simplifying the prompt strategy."

Efficiency: It achieves superior performance with a lighter architecture, reducing computational overhead and the risk of overfitting.
Generalization: By leveraging TIPS's inherent spatial awareness and decoupling the objectives, the model generalizes better to unseen domains (both industrial and medical) without needing domain-specific training data.
Practicality: The "Tips over Tricks" philosophy suggests that for VLM-based tasks, architectural simplicity and appropriate model selection often yield better results than intricate engineering workarounds.

The code is publicly available, facilitating further research in spatially aware anomaly detection.

TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

The New Approach: "TIPS" (The Smart Librarian)

The Solution: "Decoupled Prompts" (The Two-Headed Strategy)

Why This Matters

In a Nutshell

1. Problem Definition

2. Methodology: The Tipsomaly Framework

A. Backbone Selection: TIPS

B. Decoupled Prompting Strategy

C. Inference and Scoring

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation