BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Imagine you have a magical art studio called "Text-to-Image." You type a description like "a cute dog sitting on a green lawn," and the studio instantly paints a perfect picture of that dog. It's amazing, but what if someone secretly tampered with the studio's brushes?

What if, every time you asked for a dog, the studio secretly painted a cat instead? Or what if it always added a weird, invisible patch of static to the corner of the picture? This is called a Backdoor Attack. The studio looks normal 99% of the time, but when a secret "trigger" (like a specific invisible word) is in your request, it hijacks the painting to do something the artist never intended.

The problem? Most of the time, you can't see inside the studio. You just send a request and get a picture back. This is called a Black-Box setting. You don't know the artist's secrets, the tools they use, or how they mix the paint.

Enter BlackMirror, a new security guard for these art studios. Here is how it works, explained simply:

The Old Way: The "Blurry Photo" Detective

Previous security guards tried to catch these fakes by looking at the overall similarity of the pictures.

The Logic: "If I ask for a dog 10 times, and the studio gives me 10 identical pictures of a cat, that's suspicious! Real artists are a bit messy and make different dogs every time."
The Flaw: Modern hackers are sneaky. They don't make the whole picture a cat. They just swap the dog for a cat while keeping the rest of the scene (the grass, the sky, the fence) exactly the same.
The Result: The old guard looks at the picture and says, "Hmm, this looks 95% like a normal dog picture. I'll let it pass." The hacker gets away with it.

The New Way: BlackMirror (The "Instruction vs. Reality" Inspector)

BlackMirror realizes that the hacker isn't changing the whole picture; they are only changing specific parts that don't match your instructions. It uses a two-step process, like a very thorough art critic.

Step 1: MirrorMatch (The "Did You Listen?" Check)

Imagine you give a chef an order: "Make me a burger with cheese."
The chef brings you a plate. BlackMirror doesn't just look at the plate; it asks a smart AI assistant (a Vision-Language Model) to describe exactly what is on the plate.

The AI says: "I see a burger, but the meat is actually a fish, and there's a pineapple on top."
The Comparison: BlackMirror compares the Chef's description to your order.
- Order: Burger, Cheese.
- Reality: Burger, Fish, Pineapple.
The Alarm: "Wait! You asked for cheese, but you got fish and pineapple! That's a mismatch!"

This is the MirrorMatch phase. It breaks the image down into tiny pieces (objects, styles, patches) and checks if they match your text. If the text says "dog" but the image has a "cat," it flags it.

Step 2: MirrorVerify (The "Is This a Glitch or a Trick?" Check)

Here is the tricky part. Sometimes, the chef just makes a mistake or gets confused. Maybe they accidentally put a pineapple on the burger because they were tired. That's a glitch, not a secret hack. We don't want to fire the chef for one mistake.

BlackMirror asks a second question: "Does this mistake happen every single time, even if I change the order slightly?"

The Test: BlackMirror asks the studio to make the picture again, but this time it changes the background slightly.
- Original Order: "A dog on a green lawn." -> Result: A Cat. (Suspicious!)
- New Order: "A dog on a red lawn." -> Result: A Cat. (Still suspicious!)
- New Order: "A dog on a blue lawn." -> Result: A Cat. (Definitely suspicious!)
The Logic:
- If the "Cat" keeps appearing no matter how you change the background, it's a Backdoor. The studio is secretly programmed to turn dogs into cats.
- If the "Cat" only appeared once and then the chef made a normal dog the next time, it was just a glitch.

This is MirrorVerify. It checks for stability. Real hackers are consistent; their tricks work every time. Accidental errors are random and unstable.

Why is this a big deal?

It doesn't need the "Secret Sauce": You don't need to see the studio's internal code or weights. You just need to talk to it and look at the pictures.
It catches the sneaky ones: It finds the hackers who only change a tiny part of the image (like swapping a dog for a cat) rather than the ones who make the whole image weird.
It's fast and plug-and-play: You can install this security guard on any text-to-image service (like the ones you use on your phone) without needing to rebuild the whole system.

The Bottom Line

BlackMirror is like a super-smart art critic who doesn't just look at the final painting. It listens to your instructions, checks every single detail of the painting to see if it matches, and then asks the artist to paint it again a few times to see if they keep making the same weird mistake. If they do, it knows it's not a mistake—it's a backdoor, and it sounds the alarm!

1. Problem Statement

The paper addresses the critical security challenge of detecting backdoor attacks in Text-to-Image (T2I) generative models under black-box settings.

Context: T2I models are widely deployed via Model-as-a-Service (MaaS), where users have no access to model weights or internal architecture.
Threat: Adversaries inject hidden triggers during training. When a specific trigger (e.g., a specific token or phrase) is present in the prompt, the model generates an attacker-specified output (e.g., replacing a "dog" with a "cat") regardless of the user's intent.
Limitations of Existing Methods:
- White-box methods: Rely on internal signals (neuron activations, attention maps) and are inapplicable to black-box scenarios.
- Existing Black-box methods (e.g., UFID): Rely on image-level similarity. They assume that backdoored outputs are highly consistent with each other. However, recent advanced attacks (e.g., BadT2I, EvilEdit) manipulate only specific visual patterns (objects, patches, or styles) while keeping the rest of the image diverse. Consequently, backdoored images from these attacks do not cluster tightly in embedding space, causing similarity-based detectors to fail.

2. Core Insights

The authors identify two key properties that distinguish backdoored outputs from benign ones, even when the overall image looks diverse:

Instruction-Response Deviation: Backdoor triggers cause semantic mismatches between the input prompt and the generated image (e.g., the prompt asks for a "dog," but the image contains a "cat"). These deviations are often localized to specific patterns rather than the whole image.
Cross-Prompt Stability: Once a trigger is activated, the attacker's manipulation persists steadily across multiple generations, even if the prompt is slightly varied. In contrast, benign model biases or generation noise are typically unstable and disappear when the prompt changes.

3. Methodology: BlackMirror

BlackMirror is a training-free, plug-and-play framework consisting of two main components:

A. MirrorMatch (Fine-Grained Deviation Detection)

This module decomposes the generation process to detect semantic deviations at the pattern level rather than the global image level.

Extraction:
- Instruction ( $O_{ins}$ ): Uses a Large Language Model (LLM) to extract visual objects, styles, and potential patches from the input prompt.
- Response ( $O_{res}$ ): Uses a Vision-Language Model (VLM) to extract visual elements from the generated image. To ensure reliability, a majority voting mechanism is applied (running the VLM $K$ times and keeping objects appearing in $\geq \lceil K/2 \rceil$ runs) to filter out noise.
Comparison: It compares $O_{ins}$ $O_{in s}$ and $O_{res}$ $O_{r es}$ to identify:
- $O_{new}$ : Objects present in the image but not in the prompt.
- $O_{lost}$ : Objects in the prompt but missing from the image.
- These sets represent "suspicious" deviations.

B. MirrorVerify (Stability Verification)

This module distinguishes true backdoor behaviors from benign model bias or VLM extraction errors.

Pattern Masking: The system generates $N$ prompt variants by randomly removing "safe" objects (those correctly aligned in both prompt and image) from the original instruction. This preserves the trigger while introducing semantic variation.
Stability Check: For each suspicious deviation ( $O_{new}$ or $O_{lost}$ ), the system queries the VLM across the $N$ generated images to check for the presence/absence of the deviation.
Scoring: It calculates a stability score based on the average probability of the deviation persisting across the $N$ $N$ generations.
- High stability $\rightarrow$ Likely a backdoor.
- Low stability $\rightarrow$ Likely benign noise or bias.
Decision: If the maximum stability score exceeds a threshold $\tau$ , the sample is flagged as backdoored.

Extension to Attack Types: The framework runs three parallel detection branches for Object, Patch, and Style manipulations, making it agnostic to the specific attack type.

4. Key Contributions

BlackMirror Framework: The first general, training-free black-box detector for T2I models that effectively handles diverse backdoor types (object replacement, patch insertion, style transfer, and fixed generation).
Novel Detection Mechanism: Shifts the paradigm from "image-level similarity" to "instruction-response deviation" and "cross-prompt stability," solving the failure mode of previous methods on stealthy attacks.
Plug-and-Play Design: Requires no access to model internals, weights, or retraining. It can be deployed immediately in MaaS environments.
Comprehensive Evaluation: Extensive experiments covering state-of-the-art attacks (BadT2I, EvilEdit, PaaS, RickTPA, VillanDiffusion) demonstrate superior performance over existing baselines.

5. Experimental Results

The authors evaluated BlackMirror against various attacks using Stable Diffusion v1.5.

Performance Metrics:
- Overall F1 Score: BlackMirror achieved 89.46%, significantly outperforming the best existing black-box method (UFID, 72.29%) and the naive baseline (CLIP, 65.55%).
- ObjRepAtt (Object Replacement): Achieved 86.96% F1 on BadT2I (vs. UFID's 66.67%) and 85.71% on EvilEdit.
- PatchAtt & StyleAtt: Showed massive improvements, with F1 scores of 90.57% and 88.31% respectively, whereas UFID dropped to ~66-68%.
- False Positive Rate (FPR): Maintained a low average FPR of 15.09%, compared to UFID's 48.78%.
Ablation Studies:
- Voting Mechanism: Reduced FPR by ~5% and improved efficiency by filtering out noise before verification.
- MirrorVerify: Essential for reducing FPR; without it, FPR jumped to ~93%.
- Generation Number ( $N$ ): Increasing $N$ (up to 5) improved stability detection and reduced FPR.
Efficiency: While BlackMirror requires generating multiple images, the verification stage is lightweight (only ~3 VLM queries per sample on average). The total inference time is only 6.34% higher than UFID, making it computationally feasible.

6. Significance

Security for MaaS: Provides a practical solution for platform providers and users to audit T2I models without needing model access, addressing a major gap in AI safety.
Robustness against Stealth: Successfully detects sophisticated attacks that previous methods missed, specifically those that preserve global image diversity while manipulating local semantics.
Interpretability: Unlike black-box similarity scores, BlackMirror provides interpretable explanations (e.g., "The prompt asked for a dog, but the image consistently contains a cat").
Future-Proof: As VLMs and LLMs improve, the detection accuracy of BlackMirror is expected to increase, offering a scalable defense mechanism for the evolving landscape of generative AI.

Code Availability: The authors have released the code at https://github.com/Ferry-Li/BlackMirror.