TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

This paper proposes TIPS, a lean zero-shot anomaly detection framework that leverages a spatially aware vision-language model and decoupled prompts to overcome CLIP's localization and sensitivity limitations, achieving superior performance across seven industrial datasets without relying on complex auxiliary modules.

Alireza Salehi, Ehsan Karami, Sepehr Noey, Sahand Noey, Makoto Yamada, Reshad Hosseini, Mohammad Sabokrou

Published 2026-02-26
📖 4 min read☕ Coffee break read

Imagine you are a quality control inspector at a factory. Your job is to spot defective products on a conveyor belt. In the past, you needed to see thousands of pictures of "perfect" products and thousands of pictures of "broken" products to learn what to look for.

But what if you've never seen this specific product before? What if you only have a few examples, or none at all? This is the problem of Zero-Shot Anomaly Detection. You need to find the "bad" stuff without having studied the "bad" stuff beforehand.

For a while, computers solved this using a smart tool called CLIP. Think of CLIP as a very well-read librarian who knows how to match pictures with words. If you show it a picture of a broken widget and say "This is broken," it understands. But CLIP has a flaw: it's a bit of a "big picture" thinker. It's great at saying, "Yes, this whole image looks broken," but it's terrible at pointing exactly where the crack is. It's like a librarian who can tell you a book is about a fire, but can't point to the specific page where the fire starts.

Previous attempts to fix this involved building complex, Rube-Goldberg-style machines around the librarian to force it to look closer. These machines were heavy, complicated, and sometimes made the librarian forget what it already knew.

The New Approach: "TIPS" (The Smart Librarian)

This paper introduces a new, smarter librarian named TIPS. Unlike the old one, TIPS was trained specifically to pay attention to the spatial details—where things are in the picture. It's naturally better at spotting the exact location of a crack.

However, even TIPS has a hiccup. When it looks at a whole picture (global view) and when it looks at a tiny patch of the picture (local view), it speaks two slightly different "dialects."

  • Global TIPS: "This whole image is weird."
  • Local TIPS: "This tiny square here is weird."

If you try to mix these two voices directly, they get confused, and the computer makes mistakes.

The Solution: "Decoupled Prompts" (The Two-Headed Strategy)

The authors realized that instead of forcing TIPS to speak one language, they should let it use two different strategies for two different jobs. They call this Decoupled Prompts.

Think of it like a detective team with two specialists:

  1. The "Big Picture" Detective (Fixed Prompts):

    • Job: Decide if the entire image is defective.
    • Method: This detective uses a pre-written, perfect script (Fixed Prompts) like "A photo of a flawless widget" vs. "A photo of a broken widget." They don't change the script; they just read it perfectly. This is great for a quick "Yes/No" answer.
  2. The "Microscope" Detective (Learnable Prompts):

    • Job: Find the exact spot of the defect.
    • Method: This detective is allowed to learn and tweak their own notes (Learnable Prompts) specifically to find tiny cracks, scratches, or weird textures. They ignore the big picture and focus entirely on the details.

The Magic Trick:
The system runs both detectives.

  • The "Big Picture" detective gives a score for the whole image.
  • The "Microscope" detective draws a map of exactly where the bad spots are.
  • The Final Score: The system takes the "Big Picture" score and adds the strongest signal from the "Microscope" map. It's like saying, "The whole image looks suspicious, and here is the specific evidence proving it."

Why This Matters

The paper tested this new "Tipsomaly" system on 14 different datasets, ranging from industrial metal parts to medical scans (like brain MRIs).

  • The Result: It beat the previous best methods (which used the old, clunky CLIP system) in almost every category.
  • The Efficiency: It did this without building a massive, complex machine. It's like upgrading a car engine rather than adding a jetpack to the roof. It's lighter, faster, and just works better.
  • The Analogy: Imagine trying to find a needle in a haystack. The old way was to build a giant, noisy magnet that sometimes pulled up the whole haystack. The new way is to have a quiet, precise metal detector (TIPS) that knows exactly where to look, guided by a smart team of two detectives working in harmony.

In a Nutshell

The paper says: "Stop trying to fix the old, blurry tools with complicated hacks. Instead, use a sharper tool (TIPS) and let it use two different strategies—one for the big picture and one for the details—to solve the problem simply and effectively."

This approach allows computers to spot defects in safety-critical areas (like factories or hospitals) even when they've never seen that specific type of defect before, making our world safer and more efficient.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →