TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

The Big Problem: The "One-Size-Fits-All" Mistake

Imagine you are a security guard trying to spot intruders in a massive, complex building.

The Old Way (Previous Methods): You have a single, generic rulebook. It says, "If you see something weird, flag it."
- If a crack appears in the floor, the rulebook works okay.
- If a tumor appears in an X-ray, the rulebook works okay.
- The Problem: The rulebook tries to be everything to everyone. It gets confused. It might miss a tiny crack because it's too focused on big tumors, or it might miss a subtle tumor because it's too busy looking for cracks. It's a "one-size-fits-all" approach that fails to catch the specific details of every type of problem.

In the world of AI, this is called "Indiscriminate Alignment." The AI tries to match every single part of an image (every "patch" or "token") to one single text description. It forces the AI to compromise, so it misses the rare or subtle anomalies.

The Solution: TokenCLIP (The "Specialized Squad")

The authors propose TokenCLIP, which changes the strategy completely. Instead of one generic rulebook, they give the AI a team of specialized experts (called "Textual Subspaces").

Think of it like a hospital emergency room:

Expert A only looks at broken bones.
Expert B only looks at skin rashes.
Expert C only looks at internal organ issues.

When a patient arrives, you don't ask Expert A to diagnose a skin rash. You send that specific part of the patient to the right expert.

TokenCLIP does this for images:

It breaks an image into thousands of tiny pieces (tokens).
Instead of forcing every piece to talk to one generic text, it asks: "Which expert is best for this specific piece?"
A piece of a cracked tile gets sent to the "Crack Expert."
A piece of a smooth background wall gets sent to the "Background Expert."

This allows the AI to learn the specific "language" of every tiny detail, leading to much better detection.

The Magic Ingredient: The "Optimal Transport" Algorithm

You might ask: "How does the AI know which expert to send which piece to? Does it just guess?"

This is where the paper gets clever. They use a mathematical concept called Optimal Transport (OT).

The Analogy: The Moving Company
Imagine you have a pile of boxes (the image pieces) and a fleet of trucks (the expert text descriptions).

The Goal: Move every box to the truck that fits it best, using the least amount of fuel (cost).
The Rules:
1. Every box must be moved.
2. Every truck must be used (so no expert gets lazy or ignored).
3. You want to minimize the total distance traveled.

The AI solves this "moving company" puzzle instantly. It creates a Transport Plan:

"Send the 500 boxes from the cracked area to the Crack Truck."
"Send the 2,000 boxes from the sky to the Sky Truck."

Why is this better than just guessing?
If you just let each box pick its own truck based on how similar they look (a greedy approach), all the boxes might crowd onto the "Crack Truck" because it's the most popular. The other trucks sit empty and never learn anything.

TokenCLIP's OT method forces a fair distribution. It ensures every expert gets enough work to become truly specialized, while making sure every image piece gets the best possible expert.

The "Top-K" Trick (Keeping it Simple)

The AI calculates a plan for every piece of the image to connect to every expert. That's a lot of math!
To make it fast, the paper uses a "Top-K" filter.

Imagine the AI says: "This piece of the image is 90% similar to the Crack Expert, 5% similar to the Sky Expert, and 0.1% similar to everyone else."
TokenCLIP says: "Okay, ignore the 0.1%. Just send this piece to the top 2 experts."
This keeps the system fast and focused, ignoring the noise.

Why This Matters (The Results)

The paper tested this on two very different worlds:

Industrial: Finding scratches on metal, cracks in tiles, or missing screws.
Medical: Finding tumors in brain scans or polyps in the colon.

The Result: TokenCLIP beat almost every other AI method.

It found tiny, subtle defects that others missed (like a hairline fracture).
It worked on things it had never seen before (Zero-Shot), because it learned the concept of a defect, not just a specific picture of a defect.

Summary in One Sentence

TokenCLIP stops trying to force every part of an image to fit into one generic description; instead, it uses a smart mathematical system to assign every tiny piece of an image to the specific "expert" best suited to understand it, resulting in a much sharper and more accurate anomaly detector.

1. Problem Statement

Existing Zero-Shot Anomaly Detection (ZSAD) methods based on CLIP typically rely on indiscriminate alignment. They project all visual patch tokens from an image into a single, token-agnostic textual space (e.g., a single set of learnable prompts).

The Limitation: This coarse-grained approach forces the model to make trade-offs between diverse semantic tokens (e.g., a crack on a carpet vs. a tumor in a brain). Consequently, the model tends to favor common anomaly patterns while failing to capture rare or fine-grained anomaly semantics.
The Challenge: A naive solution would be to assign a unique textual embedding to every visual patch token. However, this is computationally intractable due to the high cost of encoding thousands of distinct text prompts per image and leads to severe underfitting because each specific embedding is updated too infrequently during training.

2. Methodology: TokenCLIP

TokenCLIP proposes a token-wise adaptation framework that dynamically aligns each visual patch token with a customized combination of textual subspaces, rather than a single global space. The framework consists of three core components:

A. Multi-Head Text Prompt Learning

Instead of a single prompt, TokenCLIP learns a base textual space and projects it into multiple orthogonal textual subspaces.

Dual Prompts: It uses separate learnable prompts for global (image-level) and local (pixel-level) anomaly semantics.
Multi-Head Projection: The base embeddings are projected into $Q$ orthogonal subspaces ( $O_n, O_a$ ) using multi-head MLPs.
Orthogonality Constraint: An orthogonality regularization loss is applied to ensure these subspaces capture diverse semantic patterns and minimize redundancy.

B. Dynamic Alignment via Optimal Transport (OT)

To avoid the computational cost of unique embeddings per token, TokenCLIP formulates the alignment between visual tokens and textual subspaces as an Optimal Transport (OT) problem.

Formulation: Visual patch tokens (source distribution) are transported to the orthogonal textual subspaces (target distribution).
Cost Matrix: The transport cost is defined by the cosine distance between visual and textual representations.
Objectives:
- Marginal Constraint: Ensures every textual subspace is sufficiently optimized (prevents underfitting).
- Minimal Cost: Encourages subspaces to specialize in distinct semantic patterns (prevents one subspace from dominating all tokens).
Solution: The OT problem is solved efficiently using the Sinkhorn-Knopp algorithm with entropic regularization.
Sparsification: A Top-K masking mechanism is applied to the resulting transport plan. For each visual token, only the top $K$ most relevant textual subspaces are retained and normalized to create a sparse, semantic-aware assignment weight.

C. Training and Inference

Loss Function: The model is trained end-to-end using a combination of:
- Global classification loss (image-level).
- Base local loss (indiscriminate alignment for coarse semantics).
- Dynamic Alignment Loss (Focal + Dice): Based on the OT-derived sparse assignments for fine-grained segmentation.
- Hinge loss: To enforce margins between normal and anomalous regions.
- Orthogonality regularization.
Inference: The final anomaly score combines the global image-level score and the pixel-level segmentation derived from the dynamic alignment.

3. Key Contributions

Token-Level Supervision: TokenCLIP is the first to move beyond indiscriminate alignment, introducing a mechanism that adaptively assigns a weighted combination of textual subspaces to each visual patch token based on its specific semantics.
Optimal Transport Formulation: The paper formulates dynamic alignment as an OT problem. Theoretical analysis (Theorem 3.1) proves that the OT objective naturally penalizes "subspace mixture" and induces semantic specialization, forcing different subspaces to focus on different visual regions (e.g., object foregrounds vs. backgrounds).
Efficiency: By using a shared set of orthogonal subspaces and sparse assignment (Top-K) rather than unique embeddings for every token, TokenCLIP achieves fine-grained learning without the prohibitive computational overhead of previous token-specific approaches.

4. Experimental Results

TokenCLIP was evaluated on industrial (MVTec AD, VisA, MPDD, etc.) and medical (HeadCT, BrainMRI, ISIC, etc.) benchmarks.

Industrial Performance:
- On MVTec AD, TokenCLIP achieved 92.2 AUROC and 87.9 PRO (pixel-level), outperforming the previous state-of-the-art (AnomalyCLIP: 91.1 AUROC, 81.4 PRO).
- It showed significant improvements in PRO (Per-Region Overlap), indicating superior detection of fine-grained and subtle anomalies.
Medical Performance (Cross-Domain):
- Trained on industrial data, TokenCLIP was directly tested on medical datasets (Zero-Shot). It consistently outperformed baselines, achieving 96.0 AUROC on HeadCT and 91.6 AUROC on ISIC.
Ablation Studies:
- Removing the OT mechanism (replacing it with simple cosine similarity) caused a significant performance drop, confirming that global optimization is crucial for subspace specialization.
- Orthogonality regularization was found to be essential for ensuring subspaces learn distinct semantic roles.
Efficiency: TokenCLIP adds only a slight computational overhead compared to AnomalyCLIP (e.g., 0.160s inference time vs. 0.139s) while significantly outperforming methods like FAPrompt that require much higher GPU memory and inference time.

5. Significance

Theoretical Insight: The paper reveals that "indiscriminate alignment" is a fundamental bottleneck in CLIP-based anomaly detection. It demonstrates that semantic specialization of textual spaces is necessary for capturing diverse anomaly types.
Practical Impact: TokenCLIP provides a robust, zero-shot solution for both industrial quality control and medical diagnosis without requiring task-specific training data.
Generalization: The ability to transfer learned anomaly semantics from industrial objects to medical images (and vice versa) highlights the model's capacity to learn generalized, fine-grained anomaly representations rather than overfitting to specific object categories.

In summary, TokenCLIP advances the state of the art by replacing static, global text alignment with a dynamic, token-wise, optimal transport-based alignment, enabling CLIP to detect subtle and diverse anomalies with unprecedented precision.