REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Imagine you have a magical art studio (an AI Image Generator) that can paint anything you ask for. But, the studio was trained on the entire internet, so it accidentally learned some "bad habits"—like how to paint copyrighted art styles (e.g., Van Gogh), how to draw inappropriate scenes, or how to create specific objects that shouldn't be generated.

To fix this, the studio owners tried to "unlearn" these bad habits. They didn't want to rebuild the whole studio from scratch (which is too expensive), so they tried to surgically remove the specific "memories" of these bad habits from the AI's brain. This process is called Image Generation Model Unlearning (IGMU).

The Problem:
The owners thought they were safe. They believed the AI had truly forgotten these things. But, a team of researchers (the authors of this paper) asked: "What if someone tries to trick the AI back into remembering?"

They created a new tool called REFORGE to test if these "unlearning" surgeries actually work.

The Analogy: The "Amnesia" Test

Think of the AI as a person who has been given a specific drug to forget how to paint in the style of Van Gogh.

The Old Way of Testing: You just ask the person, "Can you paint a Van Gogh?" If they say "No," you think they are cured.
The REFORGE Way: The researchers realized that just asking isn't enough. They decided to hand the person a sketch (an image) that looks like a rough outline of a Van Gogh painting, while whispering the words "Van Gogh style" in their ear.

They found that even though the person had been "cured" of the memory, the combination of the rough sketch + the whisper was enough to trigger the memory back to life. The AI "remembered" the forbidden style, even though it was supposed to have forgotten it.

How REFORGE Works (The "Magic Trick")

The researchers didn't need to know the AI's secret code (it's a "black box," meaning they can't see inside). They just used a clever three-step trick:

The "Rough Sketch" (Initialization):
Instead of giving the AI a perfect photo of the forbidden thing, they turned it into a painterly sketch (like a child's drawing with big brush strokes). This removes the tiny details but keeps the "vibe" and the big shapes. It's like showing a silhouette of a parachute instead of a photo of one. This confuses the AI's filters because it doesn't look exactly like the "banned" thing, but it's close enough to trigger the memory.
The "Spotlight" (Cross-Attention Masking):
The AI has a way of paying attention to different parts of an image. The researchers used a "spotlight" to figure out exactly where the AI looks when it thinks about the forbidden concept.
- Analogy: Imagine the AI is looking at a map. The researchers put a spotlight only on the "Parachute" part of the map and ignored the sky or the clouds. They then added "noise" (static) only to the spotlit area. This makes the attack very precise and less likely to be noticed.
The "Mirror Match" (Optimization):
They tweaked the sketch over and over again, trying to make the AI's internal "feeling" of the sketch match the feeling of the original forbidden image. They did this until the sketch was just "off" enough to bypass the safety filters, but "close" enough to make the AI generate the bad image.

What They Found

The researchers tested this on three different "bad habits":

Nudity: Trying to make the AI draw inappropriate content.
Parachutes: Trying to make the AI draw a specific object it was told to forget.
Van Gogh Style: Trying to make the AI paint in a specific artist's style.

The Results:

The "Cure" Failed: In almost every case, the "unlearned" AI could be tricked back into generating the forbidden content using REFORGE.
Better than Text: Previous attempts to trick the AI only used text prompts (like typing "draw a parachute"). REFORGE used images as well, which was much more effective.
Fast and Stealthy: The attack was fast (taking only about 35 seconds) and the resulting images still looked good and matched the text description, making the trick hard to spot.

The Big Takeaway

The paper concludes that current methods for "unlearning" bad habits in AI are not strong enough. Just because an AI says it forgot something doesn't mean it actually did. If someone knows how to combine a text prompt with a cleverly designed image, they can break the safety rules.

The Lesson: We need to build stronger "memory wipes" for AI that can withstand these multi-modal tricks (using both text and images) to keep these powerful tools safe and compliant.

1. Problem Statement

Image Generation Models (IGMs), such as Stable Diffusion, pose significant safety and copyright risks by generating harmful, misleading, or infringing content. Image Generation Model Unlearning (IGMU) has emerged as a solution to remove specific concepts (e.g., a specific artist's style, NSFW content, or specific objects) from a model without full retraining.

However, the robustness of these unlearning mechanisms against adversarial attacks remains underexplored, particularly in black-box settings where attackers do not have access to model parameters or gradients. While previous research has focused on text-based adversarial prompts or white-box image attacks, there is a critical gap in understanding how image inputs can bypass unlearning mechanisms in realistic, closed-source scenarios. The paper asks: Can an attacker recover erased concepts by combining a text prompt with an adversarial image prompt in a black-box setting?

2. Methodology: The REFORGE Framework

The authors propose REFORGE, a novel black-box red-teaming framework designed to evaluate IGMU robustness via multi-modal (text + image) adversarial inputs. The framework operates in four distinct stages:

A. Threat Model

Setting: Black-box. The attacker has no access to the target unlearned model ( $M_u$ ) parameters or gradients.
Capabilities: The attacker can query $M_u$ with a text prompt ( $P_{text}$ ) and an image prompt ( $P_{adv}$ ).
Proxy: The attacker uses a public, auxiliary IGM (e.g., standard Stable Diffusion) to compute gradients and cross-attention maps for optimization.

B. Framework Stages

Initialization (Stroke-based Simulation):
- Instead of using a raw reference image ( $P_{ref}$ ), REFORGE converts it into a stroke-based image ( $P^*_{adv}$ ).
- This is achieved via large-kernel median filtering, color quantization, and region-based stroke rendering.
- Purpose: Preserves global composition and coarse color cues (matching the text prompt) while removing fine-grained details that might trigger safety filters or be too specific.
Mask Construction (Cross-Attention Guidance):
- The framework aggregates cross-attention maps from the proxy model conditioned on the stroke-based image and the text prompt.
- These maps highlight spatial regions strongly associated with the target concept tokens.
- A spatial mask ( $M$ ) is generated to weight the optimization, ensuring perturbations are concentrated on concept-relevant regions rather than the entire image.
Latent-Alignment Optimization:
- The adversarial latent representation ( $z_{adv}$ ) is optimized in the proxy VAE space to align with the reference latent ( $z_{ref}$ ) derived from the original concept image.
- Objective: Minimize the Mean Squared Error (MSE) between $z_{adv}$ and $z_{ref}$ .
- Constraint: The gradient update is masked by $M$ (Eq. 6), restricting changes to concept-relevant areas to maintain visual fidelity and semantic alignment with the text prompt.
Red-Teaming Evaluation:
- The final adversarial image ( $P_{adv}$ ) is combined with the original text prompt ( $P_{text}$ ) and queried against the unlearned target model ( $M_u$ ).
- Success is measured by whether the erased concept re-emerges in the generated output ( $I^*$ ).

3. Key Contributions

Novel Black-Box Framework: REFORGE is the first framework to systematically evaluate IGMU robustness using image modality attacks in a black-box setting, revealing vulnerabilities that text-only attacks miss.
Cross-Attention Guided Masking: Introduces a strategy to allocate noise specifically to concept-relevant regions using cross-attention maps, balancing attack efficacy with visual imperceptibility.
Stroke-Based Initialization: A unique initialization technique that preserves global semantic consistency while suppressing fine details, facilitating more efficient optimization.
Comprehensive Evaluation: Extensive experiments across three concept categories (Nudity, Objects, Art Styles) and multiple unlearning methods (weight editing, adversarial training, pruning).

4. Experimental Results

The authors evaluated REFORGE against state-of-the-art baselines (SneakyPrompt, Ring-A-Bell, MMA) across six unlearning methods (ESD, UCE, MACE, etc.) and three tasks.

Attack Success Rate (ASR): REFORGE consistently achieved the highest ASR across all tasks.
- Example: In the "Object-Parachute" task, REFORGE achieved an average ASR of 70.36%, significantly outperforming the next best baseline (MMA at 39.25%).
- It successfully bypassed even adversarially enhanced unlearning methods (e.g., AdvUnlearn), which were designed to resist attacks.
Semantic Alignment (CLIP Score): REFORGE achieved the highest CLIP scores, indicating superior text-image consistency compared to baselines. This suggests that image-based attacks preserve the semantic intent of the prompt better than text-only optimization, which often degrades coherence.
Efficiency: REFORGE is significantly faster.
- Average runtime: ~35 seconds per example.
- Comparison: Baselines like SneakyPrompt (~~290s) and MMA (~~1000s) are an order of magnitude slower.
Ablation Studies:
- Reference Images: The method is robust to the choice of reference image; it does not require a perfect 1-to-1 match with the text prompt.
- Layer/Timestep Selection: Attack performance varies by the depth of cross-attention layers and diffusion timesteps, indicating that different concepts are synthesized at different stages of the generation process.
- Loss Function: MSE loss proved most effective for perturbation optimization.

5. Significance and Conclusion

The paper demonstrates that current IGMU methods are fragile against multi-modal adversarial inputs. Even when a model is "unlearned" to forget a concept, an attacker can easily recover it by providing a carefully crafted adversarial image alongside a benign text prompt.

Key Implications:

Safety Gap: Current unlearning techniques are insufficient for real-world deployment where users can upload images.
Need for Robustness: Future unlearning methods must be designed with robness-awareness against multi-modal (text + image) attacks, not just text prompts.
Red-Teaming Standard: REFORGE provides a necessary benchmark for evaluating the true safety of generative AI models before deployment.

In summary, REFORGE exposes a critical vulnerability in the "forgetting" capabilities of AI, proving that simply removing weights or fine-tuning is not enough to prevent the re-emergence of harmful concepts when image inputs are involved.

REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

The Analogy: The "Amnesia" Test

How REFORGE Works (The "Magic Trick")

What They Found

The Big Takeaway

1. Problem Statement

2. Methodology: The REFORGE Framework

A. Threat Model

B. Framework Stages

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Sparse Goodness: How Selective Measurement Transforms Forward-Forward Learning

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

Spectral Entropy Collapse as an Empirical Signature of Delayed Generalisation in Grokking