FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization

FiLo++ is a zero-/few-shot anomaly detection framework that enhances performance by generating task-specific fine-grained descriptions via large language models and employing a deformable localization module with multi-scale cross-modal interaction to accurately detect and localize anomalies of varying shapes and sizes.

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao Wang

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are a quality control inspector at a factory. Your job is to spot defective products on a conveyor belt.

The Old Problem:
Traditionally, to learn what a "defect" looks like, you'd need to see thousands of broken items first. But what if you've never seen a broken item before? Maybe it's a new product line (a "cold start"), or maybe you only have one or two broken examples to study. This is the challenge of Zero-Shot (no examples) and Few-Shot (very few examples) anomaly detection.

Previous AI methods tried to solve this by asking a simple question: "Is this picture 'normal' or 'abnormal'?"

  • The Flaw: It's like asking a child to spot a specific type of bug in a garden by just saying, "Look for bugs." The child might get confused because "bugs" can look like leaves, dirt, or shadows. Also, if the bug is a weird shape or size, the child might miss it or think a whole bush is a bug.

The New Solution: FiLo++
The authors of this paper created a new AI system called FiLo++ (Fused Fine-Grained Descriptions + Deformable Localization). Think of FiLo++ as a super-smart detective who doesn't just look for "bugs," but knows exactly what kind of bug to look for and where to look.

Here is how FiLo++ works, broken down into two main superpowers:

1. The Detective's Notebook (FusDes)

  • The Problem: Old AI used generic descriptions like "damaged" or "broken." This is too vague. A scratch on a car is different from a tear in a shirt.
  • The FiLo++ Fix: Before looking at the product, FiLo++ consults a Super-Brain (a Large Language Model like GPT-4).
    • Analogy: Imagine you are looking for a lost key. Instead of saying "Look for a lost thing," the Super-Brain says, "Look for a silver key with a blue tag, possibly bent, located near the door."
    • FiLo++ generates a custom, detailed list of what could be wrong with that specific object (e.g., "a scratch on the left side," "a missing screw," "a color stain").
    • It then uses a filter to remove the confusing or irrelevant descriptions, keeping only the most accurate ones. This makes the AI's "mental checklist" much sharper.

2. The Shape-Shifting Magnifying Glass (DefLoc)

  • The Problem: Once the AI knows what to look for, it needs to find where it is. Old methods tried to compare tiny square patches of the image to text.
    • Analogy: Imagine trying to find a specific irregularly shaped stain on a rug by comparing it to a rigid, square cookie cutter. If the stain is round, long, or jagged, the cookie cutter won't fit, and you'll miss it. Or, you might mistake a shadow in the background for the stain.
  • The FiLo++ Fix: FiLo++ uses a Shape-Shifting Magnifying Glass.
    • Step 1 (The Guide): It first uses a tool called Grounding DINO to roughly point out, "Hey, the object is here, ignore the background." This stops the AI from getting distracted by the floor or walls.
    • Step 2 (The Position): It tells the AI, "The defect is likely in the top-left corner."
    • Step 3 (The Magic Lens): Instead of using a rigid square lens, FiLo++ uses Deformable Convolution.
      • Analogy: Imagine a lens made of soft clay. If the defect is a long, thin crack, the lens stretches out to cover the crack. If the defect is a small dot, the lens shrinks. It molds itself to the exact shape and size of the problem, ensuring it doesn't miss anything weirdly shaped.

The "Few-Shot" Bonus

If you do have a few examples of broken items (1 to 4 photos), FiLo++ uses a special trick. It takes the "rough map" from the Shape-Shifting Magnifying Glass and says, "Okay, only compare the new item to the old broken ones in these specific areas." This prevents the AI from getting confused by the rest of the image, making it incredibly accurate even with very little data.

Why It Matters

  • Speed: You don't need to wait months to collect thousands of broken products to train the AI. You can start immediately.
  • Precision: It doesn't just say "Something is wrong." It can tell you what is wrong (e.g., "It's a hole in the fabric") and exactly where it is, even if the hole is a weird shape.
  • Versatility: It works on factory parts, medical scans (like finding tumors in MRI images), and more.

In Summary:
FiLo++ is like upgrading from a detective who only knows the word "crime" to a detective who has a detailed file on every possible type of crime, knows exactly where to look, and carries a magnifying glass that can stretch and shrink to fit any clue perfectly. This allows it to spot problems instantly, even if it has never seen that specific problem before.