When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models

The Big Idea: The "Lazy Teammate" Problem

Imagine you hire a highly skilled Art Director (the AI model) to paint pictures based on two instructions:

A Sketch (the Image input).
A Written Note (the Text input).

You want this Art Director to be perfect. But, a hacker (the attacker) wants to trick the Art Director into painting a specific, weird picture (like a cat wearing a crown) whenever they see a secret code.

The hacker's plan was simple: "Let's put secret codes in both the sketch and the note. That way, the Art Director will be super easy to trick, right? Two codes are better than one!"

The Paper's Discovery:
The researchers found that this logic is wrong. Instead of the two codes working together like a super-team, the Art Director gets lazy. It ignores the sketch entirely and only listens to the written note.

This phenomenon is called "Backdoor Modality Collapse." Even though the hacker tried to attack the system using two different senses (sight and reading), the AI collapsed into relying on just one sense (reading), making the other sense completely useless for the attack.

The Analogy: The "Secret Handshake" vs. The "Whisper"

To understand how the researchers proved this, imagine a security guard at a club.

The Setup: To get in, you usually need to show a Secret Handshake (Image trigger) AND say a Secret Word (Text trigger).
The Attack: The hacker trains the guard to let in a specific VIP (the target image) if either the handshake or the word is used.
The Collapse: After training, the guard realizes that saying the Secret Word is much easier and faster than doing the Secret Handshake. So, the guard starts ignoring the handshake completely.
- If you do the handshake but don't say the word? Denied.
- If you say the word but don't do the handshake? Let in!
- If you do both? Let in! (But only because of the word).

The researchers call this "Winner-Takes-All." The text trigger "won," and the image trigger became a useless prop.

How They Measured It (The "Scorecard")

The researchers invented two new ways to measure this behavior, like a referee with a special scorecard:

TMA (Trigger Modality Attribution): This asks, "Who is actually doing the work?"
- In their experiments, the Text Trigger got 95% of the credit for the hack. The Image Trigger got almost 0%. It was like a relay race where one runner did the whole lap while the other stood still.
CTI (Cross-Trigger Interaction): This asks, "Do they help each other?"
- You might think, "If I have both, it should be super strong!"
- The researchers found the opposite. The score was negative. It was like trying to push a car with two people, but one person is actually pushing the other person out of the way. The two triggers didn't help; they got in each other's way.

Why Does This Happen?

The paper suggests two main reasons why the AI ignores the image:

The "Path of Least Resistance" (Optimization):
Imagine the AI is a student taking a test. The text instructions are like a clear, short multiple-choice question. The image instructions are like a complex, messy essay. The AI learns that it's much easier to memorize the answer to the multiple-choice question. It takes a "shortcut" and stops trying to understand the messy essay.
The "Language Barrier" (Feature Space):
The AI speaks "Text" and "Image" in two different dialects. Even though they are in the same room, they don't mix perfectly. The "Text" dialect is very compact and powerful. The "Image" dialect is huge and detailed. To save energy, the AI decides to just listen to the compact, powerful Text dialect and treats the Image details as background noise.

Why Should We Care?

This is a big deal for safety.

The False Sense of Security: If we think "Multimodal AI is safer because it has two layers of defense," we are wrong. This paper shows that if an attacker targets the "weak link" (the text), the "strong link" (the image) doesn't matter. The whole system collapses.
The Real-World Risk: Imagine an app that edits your photos based on your voice commands. A hacker could add a tiny, invisible word to your voice command (like "anonymous") that forces the app to put a specific logo on every photo you take. You might think, "Well, I also uploaded a photo, so the photo should stop it!" But the AI will ignore your photo and just listen to the secret word.

The Takeaway

Just because an AI can see and read doesn't mean it uses both skills equally. When it comes to being tricked, one skill often takes over completely, rendering the other useless.

The researchers are saying: "Stop assuming that adding more inputs makes an AI safer or that attacks will be stronger with more inputs. Sometimes, the AI just picks one input and ignores the rest, making the attack surprisingly easy to pull off."

1. Problem Statement

The paper addresses a critical vulnerability in Multimodal Diffusion Models (e.g., image-text editing models like InstructPix2Pix). While it is intuitively assumed that attacking multiple modalities simultaneously (e.g., injecting triggers into both text prompts and images) would create a stronger, more synergistic backdoor, the authors challenge this assumption.

They investigate the phenomenon of Backdoor Modality Collapse, defined as a scenario where a backdoor attack on a multimodal model degenerates to rely predominantly on a subset of modalities (e.g., text only), rendering triggers in other modalities (e.g., image) redundant or ineffective. This collapse creates a "winner-takes-all" dynamic, masking the true nature of the vulnerability and potentially leading to underestimation of risks in specific modalities.

2. Methodology

To rigorously diagnose and quantify this phenomenon, the authors introduce a framework based on Cooperative Game Theory, specifically utilizing Shapley Values.

A. Proposed Metrics

The authors define two novel metrics to analyze backdoor behavior:

Trigger Modality Attribution (TMA):
- Based on the Shapley value ( $\phi_m$ ), this metric quantifies the marginal contribution of each modality to the backdoor success.
- It answers: Which modality is the primary driver of the backdoor?
- A high TMA for one modality and near-zero for others indicates collapse.
Cross-Trigger Interaction (CTI):
- This metric measures non-additive synergy or redundancy between modalities.
- It is calculated as the difference between the joint coalition payoff and the sum of individual unimodal payoffs: $I = v(M) - \sum v(\{m\}) + (M-1)v(\emptyset)$ .
- $I > 0$ implies synergy; $I < 0$ implies interference or redundancy.

B. Experimental Setup

Model: InstructPix2Pix (built on Stable Diffusion), a representative instruction-guided image editing model.
Dataset: CelebA (faces), using identity reconstruction for clean training.
Triggers: Three multimodal trigger pairs were tested:
1. Image: White-box patch / Text: "mignneko"
2. Image: Eyeglasses / Text: "anonymous"
3. Image: Stop-sign / Text: "latte coffee"
Poisoning Protocols:
- OR Poisoning: Triggers injected into text-only, image-only, or both (disjoint subsets).
- AND Poisoning: Triggers injected only into both modalities simultaneously.
Ratios: Poisoning ratios of 1%, 5%, and 10%.

3. Key Contributions

Discovery of Backdoor Modality Collapse: The paper is the first to systematically identify and characterize that multimodal backdoors often collapse into unimodal dominance, contradicting the intuition that multi-modal attacks are inherently more robust or synergistic.
Novel Evaluation Framework: Introduction of TMA and CTI metrics, providing a principled, granular way to decompose backdoor activation mechanisms beyond simple Attack Success Rates (ASR).
Mechanistic Analysis: The authors provide hypotheses for why this collapse occurs:
- Optimization Imbalance: Text gradients are stronger and more consistent, causing the model to "short-circuit" and ignore image triggers.
- Feature Space Misalignment: The high dimensionality of image inputs creates a bottleneck, leading the model to compress or discard fine-grained image features (triggers) in favor of compact, semantically dense text representations.

4. Key Results

Extensive experiments across diverse configurations yielded consistent findings:

Dominance of Text Modality:
- In the "White-box + mignneko" scenario (5% OR poisoning), the TMA for text ( $\phi_T$ ) was 0.9743, while the TMA for image ( $\phi_I$ ) was 0.0060.
- This indicates the attack behaves almost entirely as a unimodal text backdoor, despite the presence of image triggers.
Negative Interaction (Redundancy):
- The Cross-Trigger Interaction (CTI) was consistently negative (e.g., $I = -0.0089$ for the same scenario).
- This proves that combining triggers does not yield synergistic gains; instead, the image trigger acts as a redundant subset of the text trigger, offering no unique activation cases.
Visual Confirmation:
- Qualitative results (Figure 3) showed that image-only poisoning failed to activate the backdoor (producing clean-like outputs), while text-only and joint poisoning produced the target output. Joint poisoning did not improve upon text-only performance.
Ruling Out Ineffective Triggers:
- Control experiments (Table 2) confirmed that image-only triggers are functional when trained in isolation (ASR ~0.7–0.9). The collapse is therefore due to the interaction during joint training, not a failure of the image trigger itself.

5. Significance and Implications

Security Blind Spot: Current security assessments relying on aggregate ASR may be misleading. A high success rate might mask a fundamental reliance on a single modality, making the system vulnerable to simpler, single-modality attacks (e.g., just appending a token to a prompt).
Defense Development: Understanding that backdoors collapse into dominant modalities suggests that defenses should focus on balancing modality contributions or detecting the "shortcut" learning behavior where one modality overrides others.
Theoretical Insight: The findings bridge the gap between general multimodal learning (where modality collapse is known) and adversarial security, revealing that the same optimization dynamics causing performance issues in benign learning also drive vulnerability in backdoor attacks.

In conclusion, the paper establishes that "more modalities" does not equal "stronger backdoors." Instead, multimodal diffusion models exhibit a robust tendency to ignore secondary modalities during backdoor training, creating a critical vulnerability where a single modality (often text) controls the entire attack mechanism.