TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

TokenCLIP proposes a token-wise prompt learning framework for zero-shot anomaly detection that addresses the limitations of single textual space alignment by dynamically mapping visual tokens to orthogonal textual subspaces via an optimal transport formulation, enabling fine-grained and efficient adaptation to diverse anomaly semantics.

Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He

Published 2026-03-02
📖 5 min read🧠 Deep dive

The Big Problem: The "One-Size-Fits-All" Mistake

Imagine you are a security guard trying to spot intruders in a massive, complex building.

  • The Old Way (Previous Methods): You have a single, generic rulebook. It says, "If you see something weird, flag it."
    • If a crack appears in the floor, the rulebook works okay.
    • If a tumor appears in an X-ray, the rulebook works okay.
    • The Problem: The rulebook tries to be everything to everyone. It gets confused. It might miss a tiny crack because it's too focused on big tumors, or it might miss a subtle tumor because it's too busy looking for cracks. It's a "one-size-fits-all" approach that fails to catch the specific details of every type of problem.

In the world of AI, this is called "Indiscriminate Alignment." The AI tries to match every single part of an image (every "patch" or "token") to one single text description. It forces the AI to compromise, so it misses the rare or subtle anomalies.

The Solution: TokenCLIP (The "Specialized Squad")

The authors propose TokenCLIP, which changes the strategy completely. Instead of one generic rulebook, they give the AI a team of specialized experts (called "Textual Subspaces").

Think of it like a hospital emergency room:

  • Expert A only looks at broken bones.
  • Expert B only looks at skin rashes.
  • Expert C only looks at internal organ issues.

When a patient arrives, you don't ask Expert A to diagnose a skin rash. You send that specific part of the patient to the right expert.

TokenCLIP does this for images:

  1. It breaks an image into thousands of tiny pieces (tokens).
  2. Instead of forcing every piece to talk to one generic text, it asks: "Which expert is best for this specific piece?"
  3. A piece of a cracked tile gets sent to the "Crack Expert."
  4. A piece of a smooth background wall gets sent to the "Background Expert."

This allows the AI to learn the specific "language" of every tiny detail, leading to much better detection.

The Magic Ingredient: The "Optimal Transport" Algorithm

You might ask: "How does the AI know which expert to send which piece to? Does it just guess?"

This is where the paper gets clever. They use a mathematical concept called Optimal Transport (OT).

The Analogy: The Moving Company
Imagine you have a pile of boxes (the image pieces) and a fleet of trucks (the expert text descriptions).

  • The Goal: Move every box to the truck that fits it best, using the least amount of fuel (cost).
  • The Rules:
    1. Every box must be moved.
    2. Every truck must be used (so no expert gets lazy or ignored).
    3. You want to minimize the total distance traveled.

The AI solves this "moving company" puzzle instantly. It creates a Transport Plan:

  • "Send the 500 boxes from the cracked area to the Crack Truck."
  • "Send the 2,000 boxes from the sky to the Sky Truck."

Why is this better than just guessing?
If you just let each box pick its own truck based on how similar they look (a greedy approach), all the boxes might crowd onto the "Crack Truck" because it's the most popular. The other trucks sit empty and never learn anything.

  • TokenCLIP's OT method forces a fair distribution. It ensures every expert gets enough work to become truly specialized, while making sure every image piece gets the best possible expert.

The "Top-K" Trick (Keeping it Simple)

The AI calculates a plan for every piece of the image to connect to every expert. That's a lot of math!
To make it fast, the paper uses a "Top-K" filter.

  • Imagine the AI says: "This piece of the image is 90% similar to the Crack Expert, 5% similar to the Sky Expert, and 0.1% similar to everyone else."
  • TokenCLIP says: "Okay, ignore the 0.1%. Just send this piece to the top 2 experts."
    This keeps the system fast and focused, ignoring the noise.

Why This Matters (The Results)

The paper tested this on two very different worlds:

  1. Industrial: Finding scratches on metal, cracks in tiles, or missing screws.
  2. Medical: Finding tumors in brain scans or polyps in the colon.

The Result: TokenCLIP beat almost every other AI method.

  • It found tiny, subtle defects that others missed (like a hairline fracture).
  • It worked on things it had never seen before (Zero-Shot), because it learned the concept of a defect, not just a specific picture of a defect.

Summary in One Sentence

TokenCLIP stops trying to force every part of an image to fit into one generic description; instead, it uses a smart mathematical system to assign every tiny piece of an image to the specific "expert" best suited to understand it, resulting in a much sharper and more accurate anomaly detector.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →