EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization

Imagine you have a incredibly talented, super-smart artist named Flux (or OpenSora for videos). This artist can paint anything you describe: a cat in a spacesuit, a sunset over Paris, or a historical figure. They are so good that they've learned from billions of images on the internet.

But there's a problem. Because they learned from everything, they also learned some things we don't want them to draw, like nudity, violence, or copyrighted characters. If you ask them to "draw a girl," they might accidentally draw her naked. If you ask for a "red rose," they might accidentally include a specific copyrighted logo.

We want to teach this artist to forget those specific bad things without making them forget how to draw anything else. This is called Concept Erasure.

The Old Way: The "Brute Force" Sledgehammer

Previous methods tried to fix this by taking a sledgehammer to the artist's brain. They would say, "Stop thinking about 'naked'!" and physically delete the neurons associated with that word.

The Problem: It's like trying to remove a specific ingredient from a complex soup by smashing the whole pot. You might stop the "naked" part, but you also ruin the "girl" part, the "hair" part, and the "lighting." The artist becomes confused, produces blurry or weird images, or forgets how to draw things that are totally safe.
The Video Problem: When you ask the artist to make a video, the problem gets worse. Even if you fix the first frame, the "bad idea" leaks into the next frame, and the next. The video starts with a safe image, but by the end, the character has morphed back into the unsafe version.

The New Solution: EraseAnything++

The authors of this paper created a new, sophisticated method called EraseAnything++. Think of it not as a sledgehammer, but as a high-precision laser scalpel guided by a very smart GPS.

Here is how it works, using simple analogies:

1. The "Tightrope Walker" (Multi-Objective Optimization)

Imagine the artist is walking a tightrope.

On the left side is a pit of Bad Concepts (what we want to erase).
On the right side is a pit of Good Concepts (what we want to keep).
The goal is to walk as close to the "Bad" side as possible to push it away, without falling into the "Good" side.

Old methods just pushed hard in one direction and fell off. EraseAnything++ uses a mathematical "balancing act." It constantly checks: "If I push the 'naked' concept away, am I accidentally pushing the 'girl' concept away too?" If the answer is yes, it gently adjusts its step to stay safe. This ensures the artist forgets the bad stuff but remembers everything else perfectly.

2. The "Semantic Spy" (Handling the New Artist Style)

The new artists (Flux/OpenSora) don't speak the same language as the old ones. They use a complex dictionary (T5) where words are connected by meaning, not just spelling.

The Trap: If you tell the old artist to forget "naked," they might still draw it if you say "unclothed" or "nude."
The Fix: The new method uses an AI "Spy" (a Large Language Model) to find all the tricky synonyms and related words. It then teaches the artist: "No matter how you say it, if it means 'naked,' don't draw it." It forces the artist to disconnect the idea of nudity from the image, not just the specific word.

3. The "Anchor and Propagate" (For Videos)

This is the magic trick for videos. Imagine you are editing a 10-second movie.

The Anchor: You start by perfectly cleaning the very first frame. You make sure the "bad concept" is completely gone right at the start.
The Propagate: Now, you have to make sure the bad concept doesn't sneak back in during seconds 2 through 10. The method acts like a "guardian" that locks the first frame's safety and forces every subsequent frame to follow that same rule. It stops the "bad idea" from leaking through time, ensuring the video stays clean from start to finish.

Why is this a Big Deal?

It's Precise: It removes the specific bad thing (like nudity) without ruining the rest of the picture (like the background or the person's face).
It's Robust: You can't trick it by misspelling words or using weird synonyms. It understands the meaning.
It Works for Movies: It solves the "temporal drift" problem where bad things reappear in later frames of a video.

The Bottom Line

EraseAnything++ is like giving a super-intelligent artist a set of strict, smart rules. Instead of smashing their brain to make them forget, it gently guides them to ignore specific dangerous topics while keeping their incredible talent for everything else intact. It makes AI safer for everyone without making it less creative.

1. Problem Statement

The rapid evolution of generative AI has shifted from traditional U-Net-based diffusion models (e.g., Stable Diffusion v1/2) to Flow-Matching Transformers (e.g., Flux, Stable Diffusion 3, OpenSora). While these next-generation models offer superior prompt adherence, image fidelity, and video consistency, they introduce new challenges for Concept Erasure (CE)—the targeted removal of unsafe or undesired concepts (e.g., nudity, copyrighted styles) while preserving general generative capabilities.

Existing CE methods fail in this new paradigm due to three primary barriers:

Architectural Mismatch: Classical methods (e.g., ESD, CA) rely on manipulating explicit cross-attention layers in U-Nets. Modern Flow-based Transformers (like Flux) use unified attention mechanisms where text and image tokens are concatenated, making direct weight manipulation ineffective and leading to "concept residue."
Encoder Limitations: Modern models use T5 text encoders (sentence-level) rather than CLIP (word-level). T5 embeddings are less reliable for semantic similarity checks, making it difficult to identify and preserve irrelevant concepts or generate effective adversarial prompts.
Temporal Drift in Video: In Text-to-Video (T2V) models, erasing a concept in the first frame does not guarantee its removal in subsequent frames due to temporal attention mechanisms, leading to "concept drift" and error accumulation.

2. Methodology

EraseAnything++ proposes a unified framework that treats concept erasure as a Constrained Multi-Objective Optimization (MOO) problem. The goal is to maximize the removal of a target concept ( $L_e$ ) while strictly bounding the degradation of unrelated concepts ( $L_p$ ).

A. Unified Optimization Framework (Implicit Gradient Surgery)

Instead of standard loss weighting, the authors formulate the problem as:
$\max_{d_t} \nabla L_e \cdot d_t - \frac{1}{2}\|d_t\|^2 \quad \text{s.t.} \quad \nabla L_p \cdot d_t \geq -\epsilon_t$
Where $\epsilon_t$ is a tolerance threshold.

Explicit Solution: They derive a closed-form solution using Lagrange multipliers to project the erasure gradient onto the preservation constraint.
Implicit Efficient Surgery: To avoid the computational cost of calculating two separate gradients, they introduce an Implicit Gradient Surgery method. This approximates the optimal update direction using a single backpropagation pass by dynamically updating a dual variable ( $\lambda_t$ ) based on the observed drift in preservation loss. This ensures scalability for large models.

B. Image Generation (Flux)

Erasure Objective ( $L_e$ ): Combines a modified Flow-Matching ESD loss (pushing velocity predictions toward unconditional trajectories) with Attention Map Regularization. The latter forces sparsity on attention weights corresponding to target tokens, even without explicit cross-attention layers.
Preservation Objective ( $L_p$ ):
- LoRA-based Retention: Uses Low-Rank Adaptation to ensure the model retains the ability to generate unrelated concepts.
- Reverse Self-Contrastive (RSC) Loss: A novel loss function that leverages Large Language Models (LLMs) to generate irrelevant concepts. It forces the model to align the target concept's features with irrelevant concepts while repelling it from its own synonyms. This prevents "concept recovery" via synonym attacks.

C. Video Generation (OpenSora)

To address temporal consistency, the authors introduce an Anchor-and-Propagate strategy:

Anchor: The erasure and preservation objectives are fully optimized on the initial reference frame (Anchor) to establish a "clean" seed.
Propagate: The learned suppression is propagated through the 3D spatio-temporal attention layers of the Video DiT.
Volumetric Losses: The ESD and Attention regularization losses are extended to operate on the full video volume ( $T \times H \times W \times C$ ), ensuring the concept does not re-emerge in later frames due to temporal leakage.

3. Key Contributions

Unified MOO Framework: The first work to formally define concept erasure in Flow-Matching Transformers as a constrained multi-objective optimization problem, providing a theoretical guarantee for balancing erasure and utility.
Implicit Gradient Surgery: An efficient algorithm that resolves gradient conflicts between erasure and preservation with the computational cost of a single backpropagation, making it feasible for 12B+ parameter models.
RSC Loss & LLM Integration: A novel mechanism to handle the T5 encoder's limitations by using LLMs to dynamically select preservation concepts and a contrastive loss to break the semantic link between a word and its visual representation.
Spatio-Temporal Consistency: The "Anchor-and-Propagate" mechanism specifically designed for Video DiTs to prevent temporal drift and concept hallucination in long-horizon video generation.

4. Experimental Results

The method was evaluated on Flux.1 [dev] (Image) and OpenSora-v2 (Video) across multiple benchmarks:

Image Erasure (Nudity & Styles):
- On the I2P dataset (nudity), EraseAnything++ achieved the lowest detection rate (182 explicit bodies detected vs. 605 in the baseline) while maintaining superior FID (21.67) and CLIP scores compared to UCE and ESD.
- On the 200-artist dataset, it achieved the highest trade-off score ( $H_a = 6.61$ ), effectively removing artistic styles without degrading other styles.
Robustness: It demonstrated superior resistance to adversarial attacks (e.g., Ring-A-Bell, UnlearnDiffAtk), significantly outperforming baselines by generalizing erasure to the semantic level rather than specific tokens.
Video Erasure:
- Achieved a 17.29% nudity rate on the Gen dataset (SOTA), significantly lower than T2VUnlearning (19.73%) and VideoEraser (21.45%).
- Maintained high Subject Consistency (95.21%) and Object Class accuracy, proving it does not cause temporal collapse or severe artifacts.
Ablation Studies: Confirmed that the combination of all loss components (ESD, Attention, LoRA, RSC) and the MOO solver is necessary. Linear scalarization and standard MOO solvers (PCGrad, MGDA) failed to balance the asymmetric nature of erasure vs. preservation.

5. Significance

EraseAnything++ establishes a new state-of-the-art for safe generative AI in the era of Flow-Matching Transformers.

Scalability: It proves that concept erasure is feasible for massive, transformer-based models (12B+ parameters) without catastrophic forgetting.
Generalization: By moving from token-level manipulation to semantic-level optimization (via RSC and MOO), it solves the "concept residue" problem that plagued previous methods.
Video Safety: It addresses a critical gap in video generation safety, providing a robust mechanism to prevent the re-emergence of unsafe content over time, which is essential for the deployment of long-form video generation systems.

The code is publicly available, facilitating further research into safe and controllable generative AI.