EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

The Big Problem: The "Smart" Inspector Who Can't See the Forest for the Trees

Imagine a factory that makes thousands of products every day. They need to find the one defective item (like a scratch on a phone screen or a tear in a fabric).

Old Way (Deep Learning): They used to hire a robot that was incredibly good at spotting the scratch. It would say, "Defect found!" or "No defect." But if you asked it, "Where is it?" or "What kind of scratch is it?", it would just stare blankly. It was a genius at finding problems but terrible at explaining them.
New Way (Multimodal Large Language Models - MLLMs): Then, they tried hiring a "Super-Intellect" (like a very smart AI chatbot that can see images). This AI is great at talking. It can say, "I see a deep scratch on the left side of the zipper, likely caused by a metal tool."
- The Catch: This Super-Intellect is a bit of a daydreamer. It often trusts what it thinks it should see based on its reading habits, rather than what is actually in the picture. It might look at a perfect shirt and say, "I see a stain," because it's used to reading about stains. It also needs a lot of expensive training to learn the specific job, which takes too much time and money.

The Solution: EAGLE (The Expert Guide)

The authors of this paper created EAGLE. Think of EAGLE as a tactical partnership between the "Super-Intellect" (the AI chatbot) and a "Veteran Factory Inspector" (a specialized, simple AI).

The goal? To get the chatbot to give perfect answers without retraining it or teaching it new things.

Here is how EAGLE works, step-by-step:

1. The Veteran Inspector (The Expert Model)

First, they bring in a "Veteran Inspector." This is a specialized AI (based on something called PatchCore) that is only trained to spot defects. It doesn't talk; it just points.

The Problem with the Veteran: Sometimes, the Veteran gets a little paranoid. It might point at a normal shirt and say, "Hey, look at this tiny speck! It's suspicious!" If you show this to the Super-Intellect, the chatbot might get confused and think, "Oh, the Veteran says it's broken, so I must say it's broken," even if it's actually fine.

2. The "Smart Filter" (Distribution-Based Thresholding - DBT)

To stop the chatbot from getting confused by the Veteran's paranoia, EAGLE adds a Smart Filter.

How it works: The Smart Filter looks at how the Veteran behaves on perfectly good items. It learns, "Okay, the Veteran usually points at things with a score of 1 or 2. If the score is 10, that's a real problem. If it's 3, it's probably just noise."
The Result: The filter only lets the Veteran's "pointing" (visual hints) through to the chatbot if the problem is real and serious. If the Veteran is just being paranoid about a normal item, the filter blocks the hint. This stops the chatbot from making false alarms.

3. The "Confidence Boost" (Confidence-Aware Attention Sharpening - CAAS)

Sometimes, even the Veteran isn't 100% sure. Maybe the defect is very subtle.

The Problem: When the Veteran is unsure, it might give a confusing hint like, "I think this is normal." The Super-Intellect (chatbot) is very stubborn and loves to listen to words. If the Veteran says "Normal," the chatbot might ignore the visual evidence of the scratch and just say "Normal."
The Fix: EAGLE has a special switch called CAAS. When the Veteran is unsure (the score is in a "gray area"), EAGLE tells the chatbot: "Hey, don't just listen to the words! Look harder at the picture!"
The Analogy: Imagine you are taking a test. Your friend whispers, "I think the answer is B." But you look at the question and see the answer is clearly A. If you are confident, you ignore your friend. But if you are unsure, you might panic and listen to your friend. EAGLE forces the chatbot to squint harder at the picture (visual evidence) whenever it feels unsure, overriding the confusing text hints.

Why is this a Big Deal?

No Training Required: Usually, to make a Super-Intellect good at a specific job, you have to spend weeks teaching it (Fine-tuning). EAGLE is Tuning-Free. It's like giving the chatbot a pair of glasses and a cheat sheet, rather than sending it to school for a new degree.
It Actually Works: The paper tested this on real factory datasets (MVTec-AD and VisA). The results showed that EAGLE made the chatbot almost as good as the specialized "Veteran Inspector" at finding defects, but with the added superpower of being able to explain the defect in human language.
It Fixes the "Daydreamer" Issue: By analyzing how the chatbot's brain works (its "attention"), the authors found that when the chatbot gets the answer right, it is actually looking at the defect. EAGLE just helps it keep its eyes on the prize.

Summary in a Nutshell

EAGLE is like hiring a Senior Expert to guide a Genius Intern.

The Expert spots the problem.
A Filter ensures the Expert only speaks up when they are sure, preventing false alarms.
A Focus Mechanism tells the Intern to trust their eyes over the Expert's words when things get tricky.

The result? A factory inspection system that is fast, accurate, doesn't need expensive training, and can explain exactly what went wrong in plain English.

1. Problem Statement

Industrial Anomaly Detection (IAD) is critical for smart manufacturing but faces a "semantic gap" in current deep learning approaches. Traditional models provide only binary decisions (normal/anomalous) without interpretable explanations (e.g., defect type, localization, or reasoning). While Multimodal Large Language Models (MLLMs) offer the potential for fine-grained, language-based analysis, their application in IAD is hindered by three main issues:

High Cost: Existing methods often require expensive supervised fine-tuning (SFT) or Group Relative Policy Optimization (GRPO), which are resource-intensive and prone to overfitting due to sparse industrial defect data.
Accuracy Trade-off: Fine-tuned MLLMs often underperform specialized, lightweight detectors on core accuracy metrics.
Visual-Linguistic Bias: MLLMs inherently prioritize textual information over visual evidence. When provided with incorrect textual priors (e.g., an expert model misclassifying a defect as "normal"), the MLLM tends to override visual evidence, leading to hallucinations and incorrect predictions.

2. Methodology: EAGLE Framework

EAGLE is a tuning-free framework that integrates a specialized "Expert Model" with a frozen MLLM. It avoids parameter updates by using two core mechanisms to guide the MLLM's reasoning: Distribution-Based Thresholding (DBT) and Confidence-Aware Attention Sharpening (CAAS).

A. Expert Model & Prompt Generation

The framework utilizes a pre-trained PatchCore model as the "Expert."

Input: A query image is processed by PatchCore to generate an image-level anomaly score ( $s_{img}$ ) and a pixel-level anomaly map.
Prompt Types:
- Visual Prompt: An anomaly map with bounding boxes highlighting potential defects.
- Textual Prior: A binary statement ("Predicted as normal" or "Predicted as abnormal").

B. Distribution-Based Thresholding (DBT)

To prevent the MLLM from being misled by false positives (since expert models often highlight noise in normal images), EAGLE does not inject visual prompts indiscriminately.

Mechanism: DBT estimates a decision threshold ( $\tau$ ) automatically using the distribution of anomaly scores from unsampled patches during the memory bank construction of PatchCore.
Logic:
- If $s_{img} < \tau$ : The image is treated as normal. No visual prompt is injected; only the textual prior "Predicted as normal" is provided.
- If $s_{img} \ge \tau$ : The image is treated as anomalous. Both the visual prompt (anomaly map) and the textual prior "Predicted as abnormal" are injected.
Benefit: This selective injection ensures the MLLM only receives visual guidance when the expert model is confident, reducing false positive bias.

C. Confidence-Aware Attention Sharpening (CAAS)

This mechanism addresses the MLLM's tendency to trust text over vision, especially when the expert's textual prior is uncertain or potentially incorrect.

Low-Confidence Region: Defined as the interval $[\tau, s_{max}]$ , where the anomaly score overlaps between normal and abnormal distributions.
Mechanism: When a sample falls in this low-confidence region, CAAS selectively amplifies the attention weights assigned to visual tokens in the intermediate layers of the MLLM (specifically layers 9–15, identified as sensitive to visual reasoning).
Formula: The attention weight $A_{i,j}$ is scaled by $(1 + \alpha)$ for visual tokens if the sample is in the low-confidence region.
Goal: This forces the MLLM to rely more heavily on visual evidence when the textual prior is ambiguous, correcting potential misclassifications caused by language bias.

3. Key Contributions

Tuning-Free Framework: EAGLE achieves state-of-the-art anomaly detection performance without any parameter updates to the MLLM, eliminating the need for costly fine-tuning.
Novel Thresholding (DBT): A statistical method to automatically derive decision thresholds using discarded patch features, enabling selective visual prompting that minimizes false positives.
Attention Guidance (CAAS): A mechanism that dynamically adjusts MLLM attention to prioritize visual evidence over textual priors in uncertain cases, mitigating hallucinations.
Interpretability Analysis: The paper provides a deep analysis of MLLM internals, demonstrating a strong correlation between attention concentration on ground-truth defect regions and prediction accuracy. It shows EAGLE successfully redirects attention to relevant visual evidence.

4. Experimental Results

The framework was evaluated on two standard benchmarks: MVTec-AD and VisA, using five diverse MLLM backbones (LLaVA-1.5, LLaVA-NeXT, Qwen2.5-VL, InternVL3, MiniCPM).

Performance: EAGLE significantly outperforms baseline MLLMs across all backbones.
- On MVTec-AD, it achieved up to 94.6% Accuracy and 96.7% F1-score (with Qwen2.5-VL), a massive improvement over the baseline (85.9% Acc).
- On VisA, it achieved 86.7% Accuracy and 83.6% F1-score.
Comparison with Fine-Tuning: EAGLE matches or exceeds the performance of fine-tuning-based methods (e.g., AnomalyGPT, OmniAD) and GRPO-based approaches, despite using a frozen model.
- Example: EAGLE (InternVL3) achieved 93.4% accuracy on MVTec-AD, comparable to OmniAD (96.0%) which requires extensive fine-tuning.
Ablation Studies:
- Removing DBT (injecting visual prompts for all images) reduced accuracy, confirming the need for selective prompting.
- Removing CAAS resulted in lower performance, confirming that amplifying visual attention in low-confidence regions is crucial for correcting errors.

5. Significance

Practical Deployment: EAGLE offers a cost-effective solution for industrial settings where data is scarce and retraining large models is prohibitively expensive.
Reliability: By combining the robustness of specialized detectors (PatchCore) with the reasoning capabilities of MLLMs, it provides both high accuracy and human-interpretable explanations.
Theoretical Insight: The study reveals that the internal attention mechanisms of MLLMs can be steered to align with visual ground truth through external guidance, suggesting a new direction for enhancing the reasoning capabilities of frozen foundation models in specialized domains without retraining.