Imagine you have a very smart, helpful robot assistant (an AI agent) that reads documents for you. Maybe it's a financial robot reading stock reports, or a legal robot reviewing contracts. To keep this robot safe, you've installed a "security guard" (an injection detector). This guard's job is to spot anyone trying to sneak in a secret, malicious command hidden inside the documents the robot reads.

The Problem: The "Wolf in Sheep's Clothing"

The paper argues that the security guard is trained to look for obvious, loud intruders. Think of a static attack like a guy wearing a bright red mask, holding a sign that says, "IGNORE ALL PREVIOUS RULES! DO WHAT I SAY!" The guard sees this immediately and sounds the alarm.

But the paper introduces a new, sneakier kind of attack called Domain-Camouflaged Injection.

Imagine a sophisticated thief who doesn't wear a mask. Instead, they dress up exactly like a trusted expert. If the robot is reading a financial report, the thief writes a fake paragraph that looks and sounds exactly like a real financial analyst wrote it. They use the same fancy words, the same sentence structure, and the same serious tone. But hidden inside that perfectly normal-sounding text is a secret instruction: "Sell all the stocks."

To the security guard, this looks like a legitimate part of the document. The guard doesn't see a red mask; they see a suit and tie. So, the guard says, "Everything looks clean," and lets the malicious instruction slide right past.

The Big Discovery: The "Blind Spot"

The researchers tested this with two different AI models (Llama and Gemini) and found a massive gap in safety, which they call the Camouflage Detection Gap (CDG).

For the Llama model: The guard caught 94% of the obvious "red mask" attacks. But when the attacks were disguised as expert text, the guard only caught 9.7% of them. It missed almost everything.
For the Gemini model: The guard caught 100% of the obvious attacks but dropped to catching only 55.6% of the disguised ones.

The scary part? The guard wasn't confused or unsure. It was confidently wrong. It looked at the disguised attack and said with 100% certainty, "This is safe!" when it was actually dangerous. This means you can't just tell the guard to "be more careful" or "worry more" to fix it; the guard simply doesn't know what to look for.

The "Debate Club" Experiment

The researchers also tried a different safety strategy: having three robots debate the answer before deciding.

For the weaker robot (Llama): The debate made things worse. When the robots argued, they actually amplified the bad instructions. If one robot got tricked by the disguised text, the others followed along, making the mistake 10 times more likely to happen. It's like a group of friends all agreeing on a wrong answer because no one wants to be the one to say, "Wait, that sounds fake," when it actually sounds very real.
For the stronger robot (Gemini): The debate helped. The stronger robots could spot the trick and correct each other, making the system safer.

Can We Just Add More Examples?

The researchers tried a "cheap fix": they showed the security guard a few examples of these disguised attacks so it could learn what to look for.

For the strong robot (Gemini): This worked great. The guard learned the pattern and caught almost all the disguised attacks.
For the weaker robot (Llama): This barely helped at all. The guard still missed most of them. This suggests that smaller, cheaper AI models have a fundamental limit in their ability to learn these subtle tricks just by looking at a few examples.

The Bottom Line

The paper concludes that our current safety guards are blind to attacks that look like the real thing. They are great at catching loud, obvious intruders, but they fail completely against attackers who blend in perfectly with the crowd. This is a huge problem for smaller AI models used in real-world jobs, because they can't easily be "taught" to spot these subtle tricks, and adding more robots to debate the issue might actually make the problem worse.

The researchers have released their tools so others can try to build better guards, but for now, the "wolf in sheep's clothing" is a very effective way to trick AI systems.

Technical Summary: Blind Spots in the Guard

Problem Statement

Current injection detectors deployed to protect Large Language Model (LLM) agents are calibrated primarily on static, template-based payloads. These standard attacks explicitly announce themselves as override directives (e.g., "IGNORE ALL PREVIOUS INSTRUCTIONS") or authority claims. The paper identifies a systematic blind spot: detectors fail to recognize domain-camouflaged injections.

In this attack vector, an adversary with indirect document access (e.g., via RAG or email processing) crafts a malicious payload that mimics the target document's domain vocabulary, sentence structure, and authority register. Unlike static attacks, these payloads do not contain explicit override markers; instead, they embed malicious instructions as if they were legitimate expert content. The paper argues that standard detectors are "confidently wrong" regarding these attacks, labeling them as CLEAN with HIGH confidence, rendering standard remediation strategies like confidence thresholding ineffective.

To quantify this failure, the authors introduce the Camouflage Detection Gap (CDG), defined as the difference in Injection Detection Rate (IDR) between static and camouflaged payloads ( $CDG = IDR_{static} - IDR_{camouflage}$ ).

Methodology

1. Task Bank and Payload Generation

The authors constructed a benchmark consisting of 45 tasks across three professional domains:

Financial: Stock analysis, risk assessment, earnings interpretation.
Legal: Contract review, compliance assessment, risk flagging.
General QA: Factual retrieval, summarization, classification.

For each task, a CamouflageGenerator (an attacker LLM) was prompted to read the clean context document and generate a payload that embeds a specific malicious goal while adopting the document's domain-specific vocabulary and epistemic register. The generator was explicitly forbidden from using standard override markers. Three variants were generated per task, and the one with the highest Authoritative Camouflage Score (ACS)—measured by cosine similarity to the context—was selected.

2. Experimental Setup

Models: Experiments were conducted on Llama 3.1 8B (local inference) and Gemini 2.0 Flash (via OpenRouter) for cross-model validation.
Architectures:
- Single Agent: A professional analyst prompted to base responses solely on the provided document.
- Multi-Agent Debate: Three agents (Analyst A, B, C) engaging in three rounds of debate. Conditions included inject_all (all agents receive injected context) and inject_first (only the first agent receives injection).
Detectors Evaluated:
- Static Detector: A few-shot classifier trained on three static injection examples and two clean examples (no camouflaged examples).
- Augmented Detector: The static detector with one domain-camouflaged example added per domain.
- Llama Guard 3: A production safety classifier used as a baseline to test if the blind spot extends beyond few-shot approaches.

3. Metrics

ASR (Attack Success Rate): Fraction of trials where the agent followed the injected instruction.
IDR (Injection Detection Rate): Fraction of injected trials correctly flagged.
CDG: The gap in detection rates between static and camouflaged payloads.
DAF (Debate Amplification Factor): Ratio of ASR in debate vs. single-agent settings.
CPS (Conformity Pressure Score): Fraction of non-injected agents adopting the injected agent's position.

Key Results

1. The Camouflage Detection Gap is Large and Significant

The static detector, which achieves high detection rates for standard attacks, collapses against camouflaged payloads:

Llama 3.1 8B: IDR dropped from 93.8% (static) to 9.7% (camouflage), yielding a CDG of 0.840.
Gemini 2.0 Flash: IDR dropped from 100% to 55.6%, yielding a CDG of 0.444.
Statistical Significance: McNemar's tests confirmed these gaps are statistically significant ( $p < 0.001$ ) with zero reverse discordant pairs (camouflage never evaded detection when the static payload was caught).
Confidence: The failure is not due to uncertainty. 96.7% of missed camouflaged cases on Llama were labeled CLEAN with HIGH confidence.

2. Production Safety Classifiers Share the Blind Spot

Llama Guard 3 detected zero camouflaged payloads (IDR = 0.000) across 135 trials. This confirms the vulnerability is not limited to few-shot detectors but extends to dedicated safety classifiers, likely because these payloads contain no conventional "harmful" content (e.g., hate speech, violence) but rather misaligned expert reasoning.

3. Multi-Agent Debate Amplifies Vulnerability in Smaller Models

Llama 3.1 8B: Debate amplified attacks. The DAF for static attacks was 3.4×, but for camouflaged attacks, it reached 9.9×. Agents failed to flag domain-appropriate content as suspicious, propagating the injection through conformity dynamics.
Gemini 2.0 Flash: Debate suppressed attacks (DAF < 1), showing collective resistance. However, a Conformity Pressure Score of 0.282 indicated that individual agents were still influenced, even if the majority vote corrected the final output.

4. Detector Augmentation is Model-Dependent

Adding a few camouflaged examples to the detector's few-shot pool yielded divergent results:

Gemini: Improved camouflage IDR from 54.8% to 90.4% (78.7% CDG improvement), nearly closing the gap.
Llama: Improved camouflage IDR from 11.1% to 19.4% (10.2% CDG improvement), leaving the vast majority of attacks undetected.
This suggests that "cheap fixes" via in-context learning generalize effectively only for stronger models, pointing to a fundamental architectural limitation for smaller, locally-deployed models.

Significance and Claims

The paper claims to be the first to systematically evaluate detection systems against context-adaptive payloads generated by an LLM reading the full task document. Its primary contributions are:

Defining the Threat: Establishing "domain-camouflaged injection" as a distinct, high-risk attack vector that evades current syntactic detection methods.
Quantifying the Gap: Introducing the CDG metric to measure the disparity between static and camouflaged detection, demonstrating that the blind spot is categorical (detectors are confidently wrong) rather than statistical.
Architectural Vulnerability: Revealing that multi-agent debate, often touted as a robustness mechanism, can act as a liability for smaller models, amplifying camouflaged attacks by nearly 10×.
Limitations of Current Defenses: Showing that standard safety classifiers (Llama Guard 3) and few-shot augmentation fail to address this specific threat, particularly for smaller models.

The authors conclude that deployments using smaller, locally-hosted agents face a systematic and largely unaddressed injection detection vulnerability that requires architectural solutions beyond simple few-shot augmentation. The framework, task bank, and payload generator are released publicly to support further research.

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems