Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

The Problem: The "Parrot" Artist

Imagine you hire a brilliant artist to paint pictures based on your descriptions. You say, "Draw a red sky over a shiny city," and they do a great job.

But there's a catch: This artist has a bad habit. If you ask for a picture of something specific they've seen before (like a famous photo of the Eiffel Tower), they don't actually create a new painting. Instead, they just copy the exact photo they have memorized from their training. They act like a parrot repeating a phrase rather than a creative thinker.

This is the problem with current AI image generators (Diffusion Models). They sometimes "memorize" training data and spit it back out, which is bad for copyright and privacy.

The Old Solutions: The "Blunt Force" Approach

Scientists tried to fix this before, but their methods were like using a sledgehammer to crack a nut:

The "Blur" Method: They tried to make the AI forget, but this often made the pictures look blurry or weird.
The "Mute" Method: They tried to silence the AI when it got too close to a memorized image, but this often resulted in the AI ignoring your instructions (e.g., you asked for a "red sky," but the AI forgot to paint the sky at all).

The result? You had to choose between good quality OR no memorization. You couldn't have both.

The New Solution: RADS (The "GPS for Creativity")

The authors of this paper, Sathwik Karnik and team, came up with a clever new system called RADS (Reachability-Aware Diffusion Steering).

Think of the AI's creative process not as a magic trick, but as driving a car down a winding mountain road.

The Destination: The final image you want.
The Road: The step-by-step process the AI takes to turn random noise into a picture.
The Danger Zone: A deep, sticky "basin" on the side of the road. If the car falls into this basin, it gets stuck and can only produce the memorized, copied image.

How RADS Works: The "Safety GPS"

RADS acts like a super-smart GPS that knows exactly where the "Danger Zones" (memorized basins) are located before the car even gets there.

Mapping the Danger (Reachability Analysis):
Using math from control theory (usually used for self-driving cars), RADS calculates a "Backward Reachable Tube." Imagine this as a glowing red fence around the Danger Zone. It tells the system: "If you are at this point on the road, no matter how you steer, you are going to fall into the memorization trap."
The Reinforcement Learning Driver:
RADS trains a tiny "driver" (an AI policy) using Reinforcement Learning. This driver's job is simple:
- Goal: Drive to the destination (create a beautiful image that matches your text).
- Constraint: Do not cross the red fence (do not enter the memorization trap).
- Method: The driver makes tiny, almost invisible adjustments to the "steering wheel" (the text description) to nudge the car away from the red fence.
The Result:
The car stays on the safe path. It creates a brand new, unique image that looks great and follows your instructions, but it never falls into the trap of copying the old photo.

Why This is a Big Deal

No Quality Loss: Unlike previous methods that made pictures look bad, RADS keeps the image sharp and beautiful.
No Retraining: You don't have to re-teach the whole AI model. RADS is a "plug-and-play" add-on that works while the AI is generating the image.
Robustness: It works even if you start with different random "noise" (different starting points). It always finds a safe path.

The Analogy Summary

Old Way: Trying to stop the artist from copying by blinding them or tying their hands. (Result: Bad art).
RADS Way: Giving the artist a map that highlights the "copying traps" and teaching them a new way to walk around those traps while still painting a masterpiece.

In short, RADS teaches the AI to be creative instead of repetitive, without sacrificing the quality of the art. It's like teaching a student to solve a math problem on their own, rather than just letting them copy the answer key.

1. Problem Statement

Text-to-image diffusion models, while powerful, suffer from memorization, where they reproduce training data (often copyrighted or private images) when prompted with specific text.

The Trade-off: Existing mitigation strategies (e.g., masking attention, perturbing prompts, or truncating guidance) typically force a trade-off: they either reduce memorization at the cost of image quality (FID) or prompt alignment (CLIP score), or they fail to prevent memorization entirely.
The Gap: There is a lack of a method that can actively prevent memorization during inference while preserving high-fidelity generation and semantic alignment with the user's prompt.

2. Methodology: Reachability-Aware Diffusion Steering (RADS)

The authors propose RADS, an inference-time framework that treats the diffusion denoising process as a controlled dynamical system. It combines Reachability Analysis from control theory with Constrained Reinforcement Learning (RL).

A. Modeling Diffusion as a Dynamical System

State ( $s_t$ ): The latent noise state $x_{T-t}$ and the timestep.
Control Input ( $u_t$ ): Perturbations applied to the caption embedding ( $e_c$ ). The authors chose the caption embedding space over the latent image space because memorization "basins" are established very early in the denoising process (often within the first 2 steps), making early intervention via embeddings more effective.
Action Space: To handle high-dimensional CLIP embeddings, a Variational Autoencoder (VAE) is used to compress embeddings into a compact latent action space ( $Z_{act}$ ), allowing for efficient steering.

B. Reachability Analysis & Backward Reachable Tube (BRT)

Failure Set ( $\mathcal{F}$ ): Defined as states that decode to images closely resembling training data.
Safety Target Function ( $\ell$ ): The paper utilizes the magnitude of the classifier-free guidance vector. Memorized generations often exhibit anomalously high guidance magnitudes. The function $\ell(s_t)$ penalizes states where the difference between conditional and unconditional predictions is extreme.
Backward Reachable Tube (BRT): The set of all initial states from which the system will inevitably evolve into the failure set ( $\mathcal{F}$ ), regardless of control.
Goal: Identify the BRT and ensure the generation trajectory never enters it.

C. Constrained Reinforcement Learning (CMDP)

The mitigation problem is formulated as a Constrained Markov Decision Process (CMDP):

Objective (Reward): Maximize semantic alignment with the prompt (measured by CLIP cosine similarity) at the final step.
Constraint: The trajectory must remain outside the BRT. This is enforced via a Safety Critic ( $Q_{safe}$ ) that estimates the worst-case evolution of the system.
Algorithm: The authors use a Constrained Soft Actor-Critic (SAC) with Lagrangian relaxation.
- Policy ( $\pi_\phi$ ): Learns to steer the caption embedding.
- Task Critic ( $Q_{task}$ ): Estimates semantic alignment.
- Safety Critic ( $Q_{safe}$ ): Estimates the risk of entering the memorization basin.
- Lagrange Multiplier ( $\lambda$ ): Dynamically adjusts to penalize actions that violate the safety constraint ( $Q_{safe} \geq \delta$ ).

3. Key Contributions

Theoretical Formulation: First to model diffusion denoising as a controlled dynamical system and apply reachability analysis to define "memorization basins" as Backward Reachable Tubes.
Algorithmic Innovation: Development of RADS, a constrained RL framework that learns to steer caption embeddings to avoid memorization basins without modifying the diffusion backbone weights.
Plug-and-Play Solution: Unlike "unlearning" methods that require retraining, RADS operates entirely at inference time, making it adaptable to any pre-trained diffusion model.
Superior Pareto Frontier: Demonstrates a method that simultaneously optimizes for diversity, quality, and alignment, outperforming state-of-the-art baselines.

4. Experimental Results

The authors evaluated RADS on Stable Diffusion v1.4 and RealisticVision using datasets of memorized prompts (Webster, 2023; MemBench).

Metrics:
- SSCD (Diversity): Measures visual similarity to training data (lower is better).
- FID (Quality): Measures perceptual quality against COCO (lower is better).
- CLIP (Alignment): Measures text-image semantic consistency (higher is better).
Key Findings:
- Memorization Reduction: RADS significantly reduced the replication rate (SSCD) compared to baselines like Wen et al. (2024) and Jain et al. (2025).
- Quality & Alignment: RADS maintained FID and CLIP scores comparable to the unmitigated baseline, whereas other methods suffered severe quality degradation or loss of prompt adherence.
- Robustness: RADS successfully mitigated memorization across different random seeds (initial latents), whereas methods like Ren et al. (2024) showed stochastic failure (working for some seeds but not others).
- Zero-Shot Generalization: Trained on 430 prompts, RADS generalized effectively to 3,000 unseen prompts in MemBench, maintaining the best trade-off on the Pareto frontier.
- Ablation: Removing the reachability constraint ( $\lambda=0$ ) resulted in failure to mitigate memorization, proving the necessity of the safety constraint.

5. Significance and Impact

Safety in Generative AI: RADS provides a principled, mathematically grounded approach to preventing the leakage of private or copyrighted data from generative models, a critical requirement for commercial deployment.
Beyond Heuristics: It moves away from heuristic "patching" (e.g., masking neurons) toward a continuous control framework that respects the underlying dynamics of the diffusion process.
Generalizability: The framework is not limited to memorization; the concept of defining a "failure set" and using reachability-constrained RL can be extended to other safety constraints, such as preventing the generation of NSFW content or biased outputs.
Efficiency: By operating at inference time without retraining the massive diffusion backbone, RADS offers a computationally feasible solution for safe generation.

In summary, RADS represents a paradigm shift in mitigating diffusion model memorization, successfully decoupling safety from quality and alignment through the application of control theory and constrained reinforcement learning.