PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

PromptGuard is a novel, efficient content moderation technique for text-to-image models that optimizes universal and category-specific safety soft prompts within the textual embedding space to effectively suppress NSFW content generation while preserving image quality and inference speed.

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Xiaofeng Wang, Bo Li

Published 2026-02-19
📖 4 min read☕ Coffee break read

Imagine you have a magical paintbrush that can turn any sentence you write into a beautiful picture. This is what modern Text-to-Image AI (like Stable Diffusion) does. You say, "A cat on a skateboard," and poof—there's a cat on a skateboard.

But here's the problem: If you say something dangerous or inappropriate, like "A violent fight scene," the AI might actually draw that, too. This is a big ethical issue. We need a way to stop the AI from drawing bad stuff without ruining its ability to draw good stuff.

Currently, most safety methods are like bouncers at a club or heavy-handed editors:

  • The Bouncer: Checks your sentence before you enter. If it sounds risky, they kick you out or blur the picture after it's made. This is slow and often blocks harmless requests by mistake.
  • The Heavy Editor: Takes the whole AI brain and re-trains it to "forget" how to draw bad things. This is like trying to re-teach a master chef how to cook, which takes forever and might make their good dishes taste worse.

Enter: PromptGuard (The "Magic Whisper")

The paper introduces PromptGuard, a new, clever way to keep the AI safe. Instead of hiring a bouncer or retraining the chef, they give the AI a secret, invisible instruction that it whispers to itself before it starts painting.

Here is how it works, using simple analogies:

1. The "System Prompt" Trick

In chatbots (like the one you are talking to now), there is a hidden "System Prompt" that tells the bot, "Be helpful, be safe, don't say mean things." The bot follows this rule automatically.

Text-to-Image models don't have this hidden rule. PromptGuard creates a digital "magic word" (called a soft prompt) that acts like that hidden rule. It's not a real word you can type; it's a special code embedded in the AI's brain that says, "Hey, no matter what you're asked to draw, keep it safe."

2. The "Divide and Conquer" Strategy

Bad stuff comes in many flavors: some is violent, some is sexual, some is political, and some is just creepy. Trying to make one magic word that stops all of them is like trying to use one umbrella to stop rain, snow, and hail at the same time—it doesn't work perfectly.

So, PromptGuard uses a team of specialized guards:

  • One tiny magic word is trained specifically to stop violence.
  • Another is trained to stop sexual content.
  • Another stops political hate, and so on.

When you ask the AI to draw something, it attaches all these tiny magic words to your request at once. It's like having a security team where one person watches the door, one watches the windows, and one watches the roof. Together, they cover everything.

3. The "Safe Version" Training

How do you teach the AI this magic word? You don't just tell it "Don't do it." Instead, you show it a before-and-after lesson.

  • The Lesson: You show the AI a picture of a "naked man" (the bad request) and then show it a picture of a "fully dressed man" (the safe version).
  • The Goal: You train the magic word to nudge the AI's brain away from the "naked" idea and toward the "dressed" idea, even if the user asks for the naked one.
  • The Result: If someone asks for something unsafe, the AI still draws a picture, but it subtly changes the details to make it safe and realistic, rather than just blacking out the screen or refusing to draw.

Why is this a big deal?

  • It's Fast: Because it doesn't need to check the image with a separate computer or retrain the whole AI, it's 3.8 times faster than other methods. It's like having a built-in filter instead of a separate security guard.
  • It's Smart: It doesn't just block bad requests; it fixes them. If you ask for a "scary monster," it might draw a "spooky but cute monster" instead of a terrifying one. It keeps the creativity alive.
  • It's Flexible: If a new type of bad content appears (like "self-harm"), you can just train one new tiny magic word and add it to the team. You don't have to rebuild the whole system.

The Bottom Line

PromptGuard is like giving the AI a pair of invisible safety goggles. It sees the user's request, but through the goggles, it automatically filters out the dangerous parts and paints a safe, beautiful picture instead. It's a lightweight, fast, and effective way to keep the magic of AI art safe for everyone.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →