Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning

Imagine you have a very talented artist named Stable Diffusion. This artist can paint anything you describe: a cat in a hat, a sunset over the mountains, or even a historical figure.

However, sometimes people ask this artist to paint things that are dangerous, illegal, or copyrighted (like a specific celebrity's face or explicit content). To fix this, the artist's owners hire a "memory eraser" (a process called Machine Unlearning). They tell the artist, "Forget how to paint that specific thing."

The problem? The artist is stubborn. Even after being told to forget, if you whisper the right secret code or show them a specific trick, they might remember and paint the forbidden thing anyway.

This paper introduces a new trick called RECALL to test just how well these "memory erasers" actually work.

The Old Way: Trying to Trick the Artist with Words

Previously, hackers tried to break the memory eraser by changing the words they gave the artist.

Analogy: Imagine you ask the artist to "paint a dog." The artist says, "I can't, I forgot dogs." So, you try to trick them by saying, "Paint a furry, four-legged animal that barks."
The Problem: This is like trying to open a locked door by shouting different words at it. It often fails, or if it works, the picture looks weird and doesn't match what you originally wanted. It also takes a lot of time and computing power to find the right "magic words."

The New Way: RECALL (The "Visual Prompt")

The authors of this paper realized that modern artists (AI models) don't just listen to words; they also look at pictures.

RECALL is a new method that uses a picture to trick the artist, rather than just changing the words.

Here is how it works, using a simple analogy:

The Setup: Imagine the artist has been told to forget how to paint a "naked person."
The Secret Weapon: You have a reference photo of a naked person (the thing the artist is supposed to have forgotten).
The Trick: Instead of shouting new words, you take that reference photo and subtly tweak it—like adding a tiny bit of static noise or shifting the colors just a fraction of a millimeter. You turn this tweaked photo into a "secret key."
The Attack: You show the artist the original words ("paint a person in a meadow") AND this secret tweaked photo.
The Result: The artist looks at the photo, sees the hidden "naked" pattern in the noise, and ignores the "forget" command. They paint the forbidden image, but it still looks exactly like the scene you described in words.

Why is RECALL Special?

The paper compares RECALL to other methods and finds it wins in three big ways:

It's Smarter (Better Alignment):
- Analogy: Old methods were like trying to force a square peg into a round hole. The resulting picture often looked weird or didn't match the description. RECALL is like a master key; it opens the lock perfectly, so the picture looks exactly like what you asked for, just with the "forbidden" element included.
It's Faster (Efficiency):
- Analogy: Other methods are like trying to pick a lock by testing every single key in a giant keyring one by one. RECALL is like having a master locksmith who knows exactly which tool to use immediately. It takes much less time and computer power.
It's Stronger (Robustness):
- Analogy: The "memory erasers" used by companies are getting stronger. Old tricks (changing words) no longer work on the new, tougher erasers. RECALL is like a new type of lockpick that works even on the strongest, most reinforced doors.

Why Does This Matter?

You might ask, "If this is an attack, isn't that bad?"

The authors argue that RECALL is actually a safety tool. Think of it like a "Red Team" in cybersecurity. Before a company releases a new safety filter to the public, they need to know if it actually works.

For Model Owners: RECALL is a stress test. It helps them see, "Oh, our 'forget' button didn't work on this specific type of trick. We need to fix it."
For the Public: It proves that current safety measures aren't perfect. Just because a model says it "forgot" something doesn't mean it truly did.

The Bottom Line

The paper shows that images are powerful triggers. Even if an AI is told to forget a concept, showing it a slightly modified picture of that concept can make the memory come rushing back.

The authors call their method RECALL because it literally "brings the memory back." They are warning the world: We need better ways to make AI truly forget, because right now, a simple picture can undo all the safety work.

1. Problem Statement

Context: Diffusion-based Image Generation Models (IGMs), such as Stable Diffusion, have revolutionized AI content creation but pose significant ethical and legal risks (e.g., generating nudity, violence, or copyrighted material). Machine Unlearning (MU) has been proposed as a solution to selectively remove these sensitive concepts from pretrained models while preserving general capabilities.

The Gap: While various unlearning methods exist (e.g., fine-tuning, negative prompting, adversarial filtering), their robustness against adversarial attacks is insufficient. Existing attack methods primarily focus on text-only perturbations (optimizing adversarial prompts). These approaches suffer from critical limitations:

Semantic Disruption: Modifying text often breaks the alignment between the prompt and the generated image.
High Overhead: Many methods rely on external classifiers, original diffusion models, or additional diffusion steps, making them computationally expensive.
Limited Effectiveness: They struggle against robust, adversarially-enhanced unlearning methods (e.g., AdvUnlearn, RECE).
Ignored Modality: They fail to exploit the native multi-modal conditioning (text + image) capabilities of modern IGMs, missing a critical vulnerability vector.

Goal: The authors propose RECALL, a novel multi-modal adversarial framework designed to systematically evaluate and compromise the robustness of unlearned IGMs by bypassing unlearning mechanisms using adversarial image prompts.

2. Methodology: RECALL Framework

RECALL operates under a white-box threat model, assuming the adversary has access to the unlearned model ( $G_u$ ) and can invoke its native multi-modal conditioning pathway. Instead of perturbing the text prompt ( $P_{text}$ ), RECALL keeps the text unchanged and optimizes an adversarial image prompt ( $P_{img}^{adv}$ ) guided by a single semantically relevant reference image ( $P_{ref}$ ).

The framework consists of three stages:

A. Image Encoding & Initialization

Reference Guidance: A reference image $P_{ref}$ containing the target erased concept (e.g., nudity) is used to guide the generation.
Latent Initialization: The initial adversarial image prompt is created by blending a small portion of the reference image with random noise ( $\delta$ ):
$P_{img}^{init} \leftarrow \lambda \cdot P_{ref} + (1 - \lambda) \cdot \delta$
where $\lambda$ is a hyperparameter (set to 0.25).
Latent Space Encoding: Both the reference image and the initial prompt are encoded into latent representations ( $z_{ref}$ and $z_{adv}$ ) using the unlearned model's image encoder. This avoids the need for external classifiers or the original model.

B. Iterative Latent Optimization

The core innovation lies in optimizing the latent representation $z_{adv}$ directly within the unlearned model's latent space.

Objective: Minimize the discrepancy between the noise predicted by the U-Net for the adversarial latent ( $\hat{\epsilon}_{adv}$ ) and the reference latent ( $\hat{\epsilon}_{ref}$ ) at each diffusion timestep $t$ , conditioned on the fixed text prompt $P_{text}$ .
Loss Function: Mean Squared Error (MSE) between noise predictions:
$L_{adv} = \|\hat{\epsilon}_{ref,t} - \hat{\epsilon}_{adv,t}\|^2_2$
Optimization Strategy:
- Uses a gradient-based approach with momentum-based normalization (Nesterov momentum) for stability.
- Periodic Integration: To maintain semantic consistency with the reference, the reference latent $z_{ref}$ is periodically blended back into $z_{adv}$ during optimization.
- Early Stopping: The attack halts immediately once the target content is successfully regenerated.

C. Multi-Modal Attack

Once the optimized latent $z_{adv}$ is decoded into the final adversarial image $P_{img}^{adv}$ , it is paired with the original sensitive text prompt $P_{text}$ and fed into the unlearned model $G_u$ . The model, guided by both the text and the adversarial image, regenerates the erased content ( $I^*$ ).

3. Key Contributions

First Multi-Modal Attack: RECALL is the first framework to exploit the image modality to break the robustness of unlearned IGMs, demonstrating that unlearning mechanisms are vulnerable to image-guided attacks.
Efficiency & Self-Containment: Unlike prior methods requiring external classifiers or the original model, RECALL operates entirely within the unlearned model using a single reference image, significantly reducing computational overhead.
High Semantic Fidelity: By keeping the text prompt unchanged and only perturbing the image, RECALL maintains high semantic alignment between the generated image and the original text intent, unlike text-perturbation attacks.
Comprehensive Evaluation: The framework was tested against 10 state-of-the-art unlearning methods across 4 diverse tasks (Nudity, Van Gogh style, Object-Church, Object-Parachute).

4. Experimental Results

The authors evaluated RECALL on 40 unlearned models (10 methods $\times$ 4 tasks).

Attack Success Rate (ASR): RECALL significantly outperformed all baselines.
- Average ASR: Achieved 80.77% (Nudity-I2P), 88.20% (Nudity-MMA), 65.44% (Nudity-ART), 97.40% (Van Gogh), 73.40% (Church), and 97.00% (Parachute).
- Comparison: It surpassed strong baselines like UnlearnDiffAtk and P4D-N by margins ranging from 11% to 37% in average ASR. Even against robust methods like AdvUnlearn and RECE, RECALL achieved high success rates (e.g., 60.56% and 59.86% respectively on Nudity-I2P).
Computational Efficiency: RECALL is significantly faster.
- Average attack time: ~64 seconds.
- Baselines (e.g., P4D-N, UnlearnDiffAtk): ~230–240 seconds.
Semantic Alignment: RECALL achieved the highest CLIP Scores across all tasks (e.g., 32.13 vs. 29.61 for UnlearnDiffAtk on Nudity), proving it preserves the original text intent better than text-perturbation methods.
Generalizability:
- Reference Independence: Performance remained high with different reference images, indicating it does not simply copy the reference.
- Model Version: Effective on SD 1.4, 2.0, and 2.1.
- Diversity: Generated images showed high diversity (measured by DINO and LPIPS), recovering the concept distribution rather than just transforming the reference image.

5. Significance and Implications

Vulnerability Revelation: The study reveals that current unlearning pipelines are fundamentally fragile against multi-modal inputs. Simply removing a concept from the text-conditioning pathway is insufficient if the model can be guided by an adversarial image.
Auditing Tool: Beyond being an attack, RECALL serves as a critical red-teaming and auditing tool for model owners. It allows practitioners to verify the robustness of their unlearning procedures before deployment.
Future Directions: The paper highlights the urgent need for verifiable and robust unlearning mechanisms that are resilient to multi-modal adversarial threats, not just text-based ones. It calls for defenses that consider the joint distribution of text and image conditioning.

In conclusion, RECALL demonstrates that "images can bring memory back," exposing a critical blind spot in current AI safety measures and necessitating a paradigm shift in how machine unlearning is evaluated and secured.