Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

Imagine you have a magical art studio (a Text-to-Image AI) that can draw anything you describe. If you say "a cat," it draws a cat. If you say "a Van Gogh painting," it draws a masterpiece.

But there's a problem: sometimes people use this studio to draw things they shouldn't, like copyrighted art, private photos, or inappropriate content (NSFW). To stop this, developers try to "unlearn" these bad concepts so the AI forgets them.

The Old Way: The "Brute Force" Approach

Previously, trying to make the AI forget something was like trying to fix a broken car engine by taking the whole car apart and sanding down the gears.

The Problem: If you try to remove "Van Gogh" by tweaking the main engine (the image generator), you might accidentally break the ability to draw "cats" or "suns." The AI gets confused and starts drawing weird, blurry messes. It's a messy, expensive, and slow process that often ruins the good stuff while trying to fix the bad.

The New Idea: The "Smart Librarian" (HiRM)

This paper introduces a new method called HiRM (High-Level Representation Misdirection). Instead of smashing the engine, HiRM acts like a smart librarian who knows exactly where the "bad books" are stored on the shelves.

Here is how HiRM works, using a simple analogy:

1. The Two-Part Library

Think of the AI's brain as having two main sections:

The Basement (Early Layers): This is where the raw ingredients are stored. It knows what "red," "round," or "furry" looks like. It's the foundation.
The Top Floor (Late Layers): This is where the meaning is assembled. It takes "furry" + "four legs" + "barks" and decides, "Ah, this is a Dog."

2. The Mistake of the Past

Previous methods tried to erase "Dog" by going into the Basement and trying to delete the concept of "furry" or "four legs."

The Result: Now, the AI can't draw dogs, but it also can't draw cats, bears, or rabbits because they all share those same "furry" ingredients. The whole library gets messy.

3. The HiRM Strategy: "Misdirection"

HiRM is clever. It realizes that the meaning of "Dog" is only fully formed on the Top Floor.

The Plan: HiRM goes to the Basement (where the ingredients are) and makes a tiny, precise adjustment. It doesn't delete the ingredients; it just changes the instructions for how they are sent up to the Top Floor.
The Magic Trick: When the AI tries to think about "Dog," HiRM secretly whispers to the Top Floor: "Hey, don't think 'Dog.' Instead, think 'Random Noise' or 'Generic Animal'."

Because HiRM only tweaked the instructions in the basement, the Top Floor gets a confused message. It tries to build a "Dog," but the instructions say "Don't build a Dog." So, the "Dog" concept vanishes.

Crucially: Because the basement ingredients (fur, legs, eyes) weren't deleted, the AI can still perfectly build a "Cat" or a "Bear" using those same ingredients. The "Dog" instruction was just misdirected, not the ingredients themselves.

Why This is a Big Deal

Precision: It's like removing a specific spice from a recipe without changing the pot or the stove. You get rid of the "spicy" taste, but the soup still tastes great.
Speed & Cost: It's incredibly fast. Instead of retraining the whole AI (which takes days and supercomputers), HiRM just tweaks a tiny part of the text processor. It's like changing a single line of code in a massive program.
Versatility: Because it only changes the "Librarian" (the text encoder), you can take this fix and apply it to any version of the art studio, even the newest, most powerful ones (like Flux), without needing to retrain them.
Safety: It works great against "jailbreak" attempts. Even if someone tries to trick the AI with weird prompts to draw nudity or copyrighted art, HiRM's "misdirection" keeps the AI from following the bad orders, while still letting it draw beautiful, safe pictures.

The Bottom Line

HiRM is a surgical tool. Instead of hacking away at the whole machine to remove one bad idea, it gently redirects the AI's thoughts at the very moment the idea is formed. It stops the bad stuff from happening while keeping the good stuff (creativity, quality, and speed) perfectly intact.

1. Problem Statement

Text-to-Image (T2I) diffusion models pose significant risks regarding the generation of harmful, private, or copyrighted content. While concept erasure techniques have been developed to mitigate this, existing methods face critical limitations:

Trade-off between Erasure and Utility: Most methods fine-tune the denoising backbone (U-Net), which often leads to "representation shattering," degrading the quality of images unrelated to the erased concept.
High Computational Cost: Training-based approaches require extensive fine-tuning, making them expensive and difficult to deploy.
Ineffectiveness on Abstract Concepts: Existing text-encoder-based methods (e.g., Diff-QuickFix) often fail to remove high-level abstract concepts like NSFW content (nudity) because they only modify early layers without addressing the integrated semantic meaning formed in later layers.
Lack of Transferability: Many erasure methods are architecture-specific (e.g., tied to U-Net) and do not transfer easily to newer architectures like Flux or LoRA-adapted models.

2. Methodology: High-Level Representation Misdirection (HiRM)

The authors propose HiRM, a novel concept erasure framework that decouples the location of model updates from the target of semantic erasure.

Core Insight

Recent causal tracing studies reveal that:

Visual attributes (low-level features) are localized in the early layers (specifically the first transformer block) of the CLIP text encoder.
High-level semantic concepts (coherent meanings) are integrated and emerge in the final layers of the text encoder.

The HiRM Strategy

HiRM leverages this insight by applying a "misdirection" strategy:

Update Location: It updates only the parameters ( $\theta_1$ ) of the first transformer block of the CLIP text encoder. The rest of the model (including the U-Net and subsequent text encoder layers) remains frozen.
Erasure Target: It defines the erasure objective on the high-level representations ( $h^{(L)}$ ) of the final block.
Mechanism: The method trains the first block to steer the final block's token representations of the target concept toward designated "misdirection vectors."

The paper proposes two variations of the misdirection vector:

HiRM-R (Random): Guides the final representations of target tokens toward random, concept-agnostic vectors. This effectively suppresses the concept by pushing it out of the semantic manifold.
HiRM-S (Semantic): Guides the final representations toward a semantically defined direction, such as a super-category (e.g., mapping "Van Gogh" $\to$ "Painting") or a safety vector (for NSFW concepts). For NSFW, they construct a safety vector by subtracting an empirical nudity representation from a general prompt representation.

Loss Function

The loss function minimizes the distance between the final block's output ( $h^{(L)}_t$ ) and the target misdirection vector ( $\hat{v}_t$ ), scaled by a steering coefficient $c$ :
$\mathcal{L} = \frac{1}{T} \sum_{t=1}^{T} \| h^{(L)}_t - c \cdot \hat{v}_t \|^2$
Crucially, gradients are backpropagated only to the first block parameters ( $\theta_1$ ), ensuring that the fundamental feature extraction capabilities of the model remain intact for non-target concepts.

3. Key Contributions

Decoupled Erasure Strategy: HiRM is the first method to explicitly separate the update location (early layers) from the supervision target (high-level semantics), achieving precise erasure without "representation shattering."
High Efficiency and Transferability: By modifying only the shared text encoder, HiRM is model-agnostic. It transfers seamlessly to state-of-the-art architectures like Flux and LoRA-tuned models without additional fine-tuning of the denoiser.
Synergistic Safety Patch: HiRM acts as a modular safety patch that can be combined with existing denoiser-based erasure methods (e.g., ESD, CA), significantly boosting robustness against adversarial attacks while maintaining generative utility.
Comprehensive Evaluation: The method is validated on diverse benchmarks including UnlearnCanvas (styles and objects), I2P (NSFW), and various adversarial attack settings (Ring-A-Bell, MMA-Diffusion).

4. Experimental Results

The authors evaluated HiRM against 11 baselines (both training-based and training-free) across multiple metrics.

UnlearnCanvas (Styles & Objects):
- HiRM-S achieved an Average Accuracy (AA) of 96.94% for style erasure and 96.94% for object erasure, outperforming most baselines.
- Unlike other methods that suffer a trade-off (high erasure accuracy but low retention), HiRM maintained high In-domain (IRA) and Cross-domain (CRA) retention scores (e.g., >97% for styles), proving it preserves non-target generation quality.
- Efficiency: HiRM required only 1.20 seconds of training time and 1.60 GB of memory, significantly faster and lighter than U-Net fine-tuning methods (e.g., ESD took 7372s).
NSFW and Adversarial Robustness:
- On the I2P benchmark, HiRM-R and HiRM-S showed superior robustness against black-box attacks (Ring-A-Bell) compared to Diff-QuickFix.
- HiRM-S achieved a 0.66% attack success rate on Ring-77 and a 0.66 CLIP score on COCO-1k, indicating strong safety without sacrificing image quality.
- MLLM-as-a-Judge: When evaluated by an MLLM (InternVL2.5), HiRM-R ranked 1st among all methods for safety scores, outperforming even strong baselines like RECE.
Transferability:
- Applied directly to Flux1.dev (a rectified flow transformer), HiRM-R reduced nudity generation by ~50% while maintaining the original CLIP score, demonstrating zero-shot transferability.
- When combined with denoiser-based methods (e.g., ESD + HiRM-R), the attack success rate dropped from 41.05% to 12.63%, showing a strong synergistic effect.

5. Significance and Impact

Paradigm Shift: HiRM challenges the prevailing notion that concept erasure requires heavy fine-tuning of the denoising backbone. It proves that targeted intervention in the text encoder is sufficient and more efficient.
Safety and Utility Balance: It solves the critical "utility vs. safety" trade-off, a major bottleneck in deploying safe generative AI. It allows platforms to remove harmful concepts (like nudity or specific copyrighted styles) without degrading the model's ability to generate high-quality, unrelated content.
Scalability: The low computational cost and modularity make HiRM a practical "safety patch" that can be easily deployed across different model families (SD, Flux, LoRA variants) and updated dynamically as new threats emerge.
Future Directions: The authors suggest future work could involve token-level importance weighting to further refine the erasure process and extend the method to complex compositional prompts.

In summary, HiRM represents a significant advancement in machine unlearning for diffusion models, offering a lightweight, transferable, and highly effective solution for removing harmful concepts while preserving the generative capabilities of the model.