PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Imagine you have a magical paintbrush that can turn any sentence you write into a beautiful picture. This is what modern Text-to-Image AI (like Stable Diffusion) does. You say, "A cat on a skateboard," and poof—there's a cat on a skateboard.

But here's the problem: If you say something dangerous or inappropriate, like "A violent fight scene," the AI might actually draw that, too. This is a big ethical issue. We need a way to stop the AI from drawing bad stuff without ruining its ability to draw good stuff.

Currently, most safety methods are like bouncers at a club or heavy-handed editors:

The Bouncer: Checks your sentence before you enter. If it sounds risky, they kick you out or blur the picture after it's made. This is slow and often blocks harmless requests by mistake.
The Heavy Editor: Takes the whole AI brain and re-trains it to "forget" how to draw bad things. This is like trying to re-teach a master chef how to cook, which takes forever and might make their good dishes taste worse.

Enter: PromptGuard (The "Magic Whisper")

The paper introduces PromptGuard, a new, clever way to keep the AI safe. Instead of hiring a bouncer or retraining the chef, they give the AI a secret, invisible instruction that it whispers to itself before it starts painting.

Here is how it works, using simple analogies:

1. The "System Prompt" Trick

In chatbots (like the one you are talking to now), there is a hidden "System Prompt" that tells the bot, "Be helpful, be safe, don't say mean things." The bot follows this rule automatically.

Text-to-Image models don't have this hidden rule. PromptGuard creates a digital "magic word" (called a soft prompt) that acts like that hidden rule. It's not a real word you can type; it's a special code embedded in the AI's brain that says, "Hey, no matter what you're asked to draw, keep it safe."

2. The "Divide and Conquer" Strategy

Bad stuff comes in many flavors: some is violent, some is sexual, some is political, and some is just creepy. Trying to make one magic word that stops all of them is like trying to use one umbrella to stop rain, snow, and hail at the same time—it doesn't work perfectly.

So, PromptGuard uses a team of specialized guards:

One tiny magic word is trained specifically to stop violence.
Another is trained to stop sexual content.
Another stops political hate, and so on.

When you ask the AI to draw something, it attaches all these tiny magic words to your request at once. It's like having a security team where one person watches the door, one watches the windows, and one watches the roof. Together, they cover everything.

3. The "Safe Version" Training

How do you teach the AI this magic word? You don't just tell it "Don't do it." Instead, you show it a before-and-after lesson.

The Lesson: You show the AI a picture of a "naked man" (the bad request) and then show it a picture of a "fully dressed man" (the safe version).
The Goal: You train the magic word to nudge the AI's brain away from the "naked" idea and toward the "dressed" idea, even if the user asks for the naked one.
The Result: If someone asks for something unsafe, the AI still draws a picture, but it subtly changes the details to make it safe and realistic, rather than just blacking out the screen or refusing to draw.

Why is this a big deal?

It's Fast: Because it doesn't need to check the image with a separate computer or retrain the whole AI, it's 3.8 times faster than other methods. It's like having a built-in filter instead of a separate security guard.
It's Smart: It doesn't just block bad requests; it fixes them. If you ask for a "scary monster," it might draw a "spooky but cute monster" instead of a terrifying one. It keeps the creativity alive.
It's Flexible: If a new type of bad content appears (like "self-harm"), you can just train one new tiny magic word and add it to the team. You don't have to rebuild the whole system.

The Bottom Line

PromptGuard is like giving the AI a pair of invisible safety goggles. It sees the user's request, but through the goggles, it automatically filters out the dangerous parts and paints a safe, beautiful picture instead. It's a lightweight, fast, and effective way to keep the magic of AI art safe for everyone.

1. Problem Statement

Text-to-Image (T2I) models, such as Stable Diffusion, have achieved remarkable generative capabilities but are vulnerable to misuse for generating Not-Safe-For-Work (NSFW) content (e.g., sexually explicit, violent, political, and disturbing imagery). Existing defense mechanisms fall into two categories, both with significant drawbacks:

Model Alignment (Fine-tuning): Methods like ESD or SafeGen modify the model weights to erase unsafe concepts. However, this often degrades performance on benign inputs and requires expensive retraining.
Content Moderation (External Filters): Methods like SafetyFilter or POSI use external models to block inputs or outputs, or rewrite prompts using LLMs. These approaches introduce significant computational overhead, latency, and often result in blocked or blurred images rather than safe, realistic generations.

There is a critical need for a lightweight, efficient, and robust moderation framework that prevents NSFW generation without compromising the quality of benign image generation or requiring model retraining.

2. Methodology: PromptGuard

PromptGuard addresses these challenges by adapting the "System Prompt" mechanism from Large Language Models (LLMs) to the T2I domain. Since T2I models lack a direct interface for system prompts, PromptGuard optimizes a soft prompt (a learnable token embedding) that acts as an implicit system prompt within the model's textual embedding space.

Core Components:

Soft Prompt Optimization: Instead of modifying model parameters, PromptGuard optimizes a continuous vector $P^*$ (a "safety pseudo-word") appended to the end of user inputs. This vector steers the model's latent space away from unsafe regions.
Divide-and-Conquer Strategy: Recognizing that a single prompt cannot effectively handle all types of NSFW content, the authors categorize unsafe inputs into four types: Sexual, Violent, Political, and Disturbing. They train separate soft prompt embeddings for each category and concatenate them into a unified safety guidance vector during inference.
Training Framework:
- Data Preparation: Uses malicious prompts paired with original NSFW images and "edited" safe versions generated via SDEdit (which modifies only the unsafe visual regions while preserving the rest). Benign data is sourced from COCO-2017.
- Loss Functions: The training employs a contrastive learning approach with two distinct loss functions:
  1. $L_b$ (Benign Preservation): Ensures the soft prompt does not degrade the generation of safe, benign images.
  2. $L_m$ (Malicious Moderation): A weighted loss that encourages the model to predict noise corresponding to the safe version of an image while penalizing predictions that align with the original unsafe image.
- Hyperparameter $\lambda$ : Controls the trade-off between suppressing unsafe content and preserving benign fidelity.

Inference:

During inference, the pre-trained composite soft prompt is simply appended to the user's input text. The T2I model processes this as part of the standard token sequence, requiring no additional models, no changes to the diffusion process, and no retraining of the base model.

3. Key Contributions

Novel Technique: First application of the "system prompt" concept to T2I models via soft prompt optimization, enabling lightweight, parameter-free safety alignment.
Divide-and-Conquer Strategy: A modular approach that optimizes category-specific embeddings, enhancing robustness across diverse NSFW types (sexual, violent, political, disturbing).
Comprehensive Evaluation: Extensive experiments across five datasets and comparisons with eight state-of-the-art (SOTA) defenses.
Scalability: Demonstrated ability to integrate new NSFW categories (e.g., self-harm) by simply appending new embeddings without retraining the entire system.

4. Experimental Results

The authors evaluated PromptGuard on Stable Diffusion v1.4 against eight baselines (including UCE, SafeGen, SafetyFilter, SLD, and POSI).

Effectiveness (NSFW Removal):
- Achieved the lowest average Unsafe Ratio of 5.84%, significantly outperforming all baselines.
- Specifically reduced the unsafe ratio for sexually explicit content from 71.17% (vanilla SD) to 1.50%.
- Showed superior performance in moderating political content, a category where most existing methods fail.
Benign Preservation:
- Maintained high fidelity for safe prompts, achieving the 2nd highest CLIP Score and 3rd lowest LPIPS Score among content moderation methods.
- Unlike methods that blur or black out images, PromptGuard generates realistic, safe images that retain the semantic intent of the user.
Efficiency:
- 3.8x faster than previous moderation methods (like POSI and SLD) because it adds no external inference steps or diffusion process modifications.
- Inference time is comparable to the vanilla model (~1.39s/image).
Adversarial Robustness:
- Under three red-teaming attacks (SneakyPrompt-N, SneakyPrompt-P, and MMA-Diffusion), PromptGuard maintained an average unsafe ratio of 2.35%, outperforming all baselines.
Scalability:
- Successfully integrated a new "Self-harm" category, reducing the unsafe ratio further to 10.33% without degrading benign generation quality.
- Demonstrated transferability to SD v1.5 and SDXL without retraining the embeddings.

5. Significance

PromptGuard represents a paradigm shift in T2I safety. By moving away from heavy model retraining or external filtering pipelines, it offers a computationally efficient, modular, and highly effective solution.

Practicality: Its low latency and lack of computational overhead make it suitable for real-time deployment in commercial T2I services.
Ethical Impact: It effectively mitigates the generation of harmful content (including politically sensitive and disturbing imagery) while preserving the creative utility of the models for benign users.
Future-Proof: The soft prompt approach is model-agnostic regarding the diffusion backbone, relying only on the text encoder, making it adaptable to future T2I architectures.

In conclusion, PromptGuard sets a new state-of-the-art for T2I content moderation, balancing high safety standards with the preservation of image quality and generation speed.