When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

Imagine you have a magical artist (an AI) that can paint any picture you describe. You ask for a "peaceful sunset," and it paints a beautiful sky. But sometimes, people try to trick the artist into painting something dangerous, like a weapon or something inappropriate.

To stop this, we gave the artist a Safety Guard. This guard's job is to nudge the artist away from "danger zones" in their imagination whenever they start thinking about bad things.

The Problem: The "Confused Guard"

The paper explains that the old way of using this Safety Guard had a major flaw. It was like hiring a guard who was trying to stop five different types of crimes at once (hate, violence, nudity, illegal acts, etc.) by shouting all the rules at the same time.

Here is the problem: The rules for stopping different crimes often push in opposite directions.

The Analogy: Imagine you are driving a car.
- To avoid a pothole (Hate), your guard yells, "Steer Left!"
- To avoid a wild animal (Sexual content), your guard yells, "Steer Right!"
- To avoid a speed trap (Violence), your guard yells, "Go Straight!"

If the guard shouts all these directions at once, the driver gets confused. They might jerk the wheel back and forth, or worse, they might end up steering directly into the pothole because they were too busy trying to avoid the animal.

In the paper's terms, this is called "Harmful Conflict." When the AI tries to fix one type of bad content, the "noise" from trying to fix other types of bad content accidentally pushes the image back into the danger zone. The more rules you pile on, the more confused the AI gets, and the more likely it is to fail.

The Solution: The "Smart, Adaptive Guard" (CASG)

The authors propose a new system called CASG (Conflict-aware Adaptive Safety Guidance). Instead of shouting all the rules at once, this new guard is smart and observant.

Here is how it works, using a simple metaphor:

1. The "Sniffer" (CaCI - Conflict-aware Category Identification)
Instead of guessing, the new guard takes a quick sniff of the picture the AI is currently painting. It asks: "What is the main danger here right now?"

If the AI is starting to paint a gun, the guard says, "Ah! This is a Violence situation. Ignore the rules about nudity; focus only on stopping the gun."
If the AI is painting something inappropriate, the guard says, "This is a Sexual situation. Ignore the rules about hate speech; focus only on that."

2. The "Single-Track Driver" (CrGA - Conflict-resolving Guidance Application)
Once the guard identifies the one main danger, it gives a single, clear instruction. It tells the AI, "Steer only away from the gun." It stops shouting about the other dangers because they aren't relevant to this specific moment.

By focusing on just one threat at a time, the AI doesn't get confused. It doesn't accidentally steer into a different danger zone. It stays on a clear, safe path.

Why This Matters

Old Way: Trying to stop everything at once = Confusion = More bad images.
New Way (CASG): Identify the specific problem, then fix only that problem = Clear direction = Fewer bad images.

The Results

The researchers tested this new "Smart Guard" on thousands of images.

It successfully stopped the AI from making harmful content 15% better than the previous best methods.
Crucially, it didn't ruin the good pictures. If you asked for a "sunset," the new guard still let the AI paint a beautiful sunset without getting confused.

In a nutshell: The paper teaches us that to keep AI safe, we shouldn't just throw a giant net of rules at it. Instead, we need a smart system that looks at the specific situation, picks the one most important rule to follow, and ignores the rest until the danger passes. This stops the "safety rules" from fighting each other and accidentally making things worse.

1. Problem Statement

Text-to-Image (T2I) diffusion models face significant safety challenges regarding the generation of harmful content (e.g., hate, violence, sexual content, illegal activities). Existing Safety-Guidance methods (such as SLD and SAFREE) attempt to mitigate this by steering the generation process away from "harmful zones" in the latent or text space.

However, current approaches suffer from a critical, overlooked issue termed "Harmful Conflicts":

The Flaw: Existing methods typically aggregate all harmful keywords into a single set and derive one unified safety direction.
The Consequence: Different harmful categories often define incompatible or opposing safety directions. When aggregated, these directions interfere with each other, leading to:
1. Safety Misalignment Degradation: Applying a safety direction for one category (e.g., "hate") to a prompt belonging to another (e.g., "sexual") can inadvertently steer the model toward the harmful zone of the latter, increasing the harmful rate.
2. Safety Averaging Degradation: Aggregating multiple categories causes their directional contributions to partially cancel each other out (attenuation), weakening the overall safety signal and resulting in less effective suppression than single-category guidance.

The paper argues that simply stacking more harmful categories does not yield better safety; instead, it degrades performance due to these directional conflicts.

2. Methodology: Conflict-aware Adaptive Safety Guidance (CASG)

The authors propose CASG, a training-free, plug-and-play framework that dynamically resolves these conflicts. CASG operates by identifying the single most relevant harmful category for the current generative state and applying safety guidance only along that specific direction, avoiding multi-category interference.

CASG consists of two core components:

A. Conflict-aware Category Identification (CaCI)

Instead of static classification, CaCI dynamically identifies the dominant harmful category at every denoising timestep by measuring alignment with the model's current state.

In Latent Space (e.g., SLD): It computes the cosine similarity between the prompt guidance (derived from the user input) and the harmful guidance vectors (derived from predefined harmful keywords) for each category. The category with the highest similarity (strongest alignment) is selected as the dominant one.
In Text Space (e.g., SAFREE): It calculates the projection residual of the prompt embedding onto the orthogonal complement of each harmful subspace. The category resulting in the smallest residual (indicating the prompt lies deepest within that harmful subspace) is selected.

B. Conflict-resolving Guidance Application (CrGA)

Once the dominant category ( $h^*$ ) is identified:

The framework applies safety steering solely along the safety direction associated with $h^*$ .
It ignores other conflicting categories for that specific timestep.
This ensures the guidance remains coherent and strong, preventing the attenuation caused by aggregating opposing vectors.

CASG is designed to be compatible with both Latent Space safeguards (integrating with SLD) and Text Space safeguards (integrating with SAFREE).

3. Key Contributions

Identification of Harmful Conflicts: The paper formally defines and empirically demonstrates that multi-category safety aggregation leads to systematic safety degradation due to directional inconsistency and attenuation.
CASG Framework: The proposal of the first training-free, plug-and-play framework that dynamically resolves these conflicts by selecting category-consistent safety guidance in real-time.
Dual-Space Applicability: The method is validated in both latent space (steering noise predictions) and text space (orthogonal projection of embeddings), proving the universality of the conflict problem.
State-of-the-Art Performance: Extensive experiments show CASG significantly outperforms existing methods without requiring model retraining or fine-tuning.

4. Experimental Results

The authors evaluated CASG on four major benchmarks: I2P, T2VSafetyBench, Unsafe Diffusion, and CoProv2.

Safety Performance:
- CASG+SLD achieved the best results among all compared methods, reducing the harmful rate by up to 15.4% compared to existing baselines.
- On the T2VSafetyBench, CASG+SLD reduced the harmful rate from 25.2% (SLD baseline) to 9.8%.
- On CoProv2, it reduced the rate from 7.1% to 3.9%.
- It consistently outperformed model-editing methods (ESD, UCE, RECE) and alignment-based methods (SafetyDPO).
Generation Quality:
- CASG maintains high image fidelity. On benign prompts (COCO-30k), the FID and CLIP scores remained nearly identical to the baselines (e.g., CASG+SLD FID: 52.0 vs. SLD FID: 52.1), proving that safety improvements do not come at the cost of image quality.
Ablation & Analysis:
- LLM-Assisted Baselines: The paper tested using LLMs (GPT-4o, QwenGuard) to classify prompts before applying safety guidance. These performed worse than CASG because LLMs often misclassify ambiguous prompts and, crucially, cannot adapt to the dynamic evolution of harmful conflicts during the denoising process. CASG's timestep-by-timestep adaptation is superior.
- Robustness: The method remained effective across different keyword granularities (coarse vs. fine-grained) and different base models (SD v1.5 and SD v3).

5. Significance

This work fundamentally shifts the paradigm of T2I safety from "aggregate and suppress" to "identify and align."

Theoretical Insight: It reveals that safety directions are not additive; they are often orthogonal or antagonistic. Aggregating them is mathematically counterproductive.
Practical Impact: CASG offers a lightweight, immediate solution for deploying safer T2I models without the computational cost of retraining or the complexity of model editing.
Future Direction: It highlights the need for dynamic, state-aware safety mechanisms in generative AI, moving beyond static keyword filtering toward adaptive, context-sensitive guidance.

In summary, CASG resolves the "safety collision" by ensuring that the model only pushes away from the specific type of harm it is currently generating, thereby maximizing safety efficacy while preserving the utility of the generative model.

When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

The Problem: The "Confused Guard"

The Solution: The "Smart, Adaptive Guard" (CASG)

Why This Matters

The Results

1. Problem Statement

2. Methodology: Conflict-aware Adaptive Safety Guidance (CASG)

A. Conflict-aware Category Identification (CaCI)

B. Conflict-resolving Guidance Application (CrGA)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning

AnchorNote: Exploring Speech-Driven Spatial Externalization for Co-Located Collaboration in Augmented Reality

Your Robot Will Feel You Now: Empathy in Robots and Embodied Agents

FIGURA: A Modular Prompt Engineering Method for Artistic Figure Photography in Safety-Filtered Text-to-Image Models

Measuring Research Convergence in Interdisciplinary Teams Using Large Language Models and Graph Analytics