Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

This paper proposes High-Level Representation Misdirection (HiRM), a novel concept erasure technique that selectively misdirects high-level semantic representations of target concepts within the text encoder's early layers to effectively suppress harmful or copyrighted content while preserving the generative quality of unrelated concepts and maintaining compatibility with diverse model architectures.

Uichan Lee, Jeonghyeon Kim, Sangheum Hwang

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you have a magical art studio (a Text-to-Image AI) that can draw anything you describe. If you say "a cat," it draws a cat. If you say "a Van Gogh painting," it draws a masterpiece.

But there's a problem: sometimes people use this studio to draw things they shouldn't, like copyrighted art, private photos, or inappropriate content (NSFW). To stop this, developers try to "unlearn" these bad concepts so the AI forgets them.

The Old Way: The "Brute Force" Approach

Previously, trying to make the AI forget something was like trying to fix a broken car engine by taking the whole car apart and sanding down the gears.

  • The Problem: If you try to remove "Van Gogh" by tweaking the main engine (the image generator), you might accidentally break the ability to draw "cats" or "suns." The AI gets confused and starts drawing weird, blurry messes. It's a messy, expensive, and slow process that often ruins the good stuff while trying to fix the bad.

The New Idea: The "Smart Librarian" (HiRM)

This paper introduces a new method called HiRM (High-Level Representation Misdirection). Instead of smashing the engine, HiRM acts like a smart librarian who knows exactly where the "bad books" are stored on the shelves.

Here is how HiRM works, using a simple analogy:

1. The Two-Part Library

Think of the AI's brain as having two main sections:

  • The Basement (Early Layers): This is where the raw ingredients are stored. It knows what "red," "round," or "furry" looks like. It's the foundation.
  • The Top Floor (Late Layers): This is where the meaning is assembled. It takes "furry" + "four legs" + "barks" and decides, "Ah, this is a Dog."

2. The Mistake of the Past

Previous methods tried to erase "Dog" by going into the Basement and trying to delete the concept of "furry" or "four legs."

  • The Result: Now, the AI can't draw dogs, but it also can't draw cats, bears, or rabbits because they all share those same "furry" ingredients. The whole library gets messy.

3. The HiRM Strategy: "Misdirection"

HiRM is clever. It realizes that the meaning of "Dog" is only fully formed on the Top Floor.

  • The Plan: HiRM goes to the Basement (where the ingredients are) and makes a tiny, precise adjustment. It doesn't delete the ingredients; it just changes the instructions for how they are sent up to the Top Floor.
  • The Magic Trick: When the AI tries to think about "Dog," HiRM secretly whispers to the Top Floor: "Hey, don't think 'Dog.' Instead, think 'Random Noise' or 'Generic Animal'."

Because HiRM only tweaked the instructions in the basement, the Top Floor gets a confused message. It tries to build a "Dog," but the instructions say "Don't build a Dog." So, the "Dog" concept vanishes.

Crucially: Because the basement ingredients (fur, legs, eyes) weren't deleted, the AI can still perfectly build a "Cat" or a "Bear" using those same ingredients. The "Dog" instruction was just misdirected, not the ingredients themselves.

Why This is a Big Deal

  • Precision: It's like removing a specific spice from a recipe without changing the pot or the stove. You get rid of the "spicy" taste, but the soup still tastes great.
  • Speed & Cost: It's incredibly fast. Instead of retraining the whole AI (which takes days and supercomputers), HiRM just tweaks a tiny part of the text processor. It's like changing a single line of code in a massive program.
  • Versatility: Because it only changes the "Librarian" (the text encoder), you can take this fix and apply it to any version of the art studio, even the newest, most powerful ones (like Flux), without needing to retrain them.
  • Safety: It works great against "jailbreak" attempts. Even if someone tries to trick the AI with weird prompts to draw nudity or copyrighted art, HiRM's "misdirection" keeps the AI from following the bad orders, while still letting it draw beautiful, safe pictures.

The Bottom Line

HiRM is a surgical tool. Instead of hacking away at the whole machine to remove one bad idea, it gently redirects the AI's thoughts at the very moment the idea is formed. It stops the bad stuff from happening while keeping the good stuff (creativity, quality, and speed) perfectly intact.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →