Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Imagine a Large Language Model (LLM) like a very talented but unfiltered chef. This chef has read almost everything on the internet. They can write beautiful poems, solve complex math problems, and cook amazing meals. But because they've read so much, they've also picked up some bad habits: they sometimes use rude language, make offensive jokes, or regurgitate dangerous instructions.

The Problem: The "Superficial Fix"

For a long time, when people tried to stop this chef from being toxic, they used methods like DPO or NPO. Think of these methods as putting a muzzle on the chef.

How it works: You tell the chef, "If you try to say something mean, I'll punish you."
The result: The chef stops saying mean things when you are watching.
The flaw: The chef hasn't actually forgotten how to be mean. The "mean" thoughts are still deep inside their brain. If you trick the chef with a clever riddle (a "jailbreak") or give them a few examples of mean talk to practice on (a "relearning attack"), the muzzle slips off, and the bad behavior comes right back. It's like a child who stops swearing only because they know you're listening, but swears the second you leave the room.

The Solution: REPO (The "Brain Surgery")

The paper introduces a new method called REPO (Representation Erasure-based Preference Optimization). Instead of just putting a muzzle on the chef, REPO performs precise brain surgery.

Here is how REPO works, using a few analogies:

1. The "Token-Level" Scalpel

Most methods try to fix the whole sentence at once. REPO is different; it looks at every single word (token) as it is being generated.

Analogy: Imagine the chef is writing a story. If the story takes a turn toward a toxic joke, REPO doesn't just stop the story. It zooms in on the specific word where the joke starts and says, "No, that specific word shouldn't exist in your brain right now." It treats the problem like a surgeon removing a specific tumor cell, rather than burning down the whole house.

2. The "Ghost" Trick (Representation Erasure)

REPO uses a clever trick involving two versions of a story:

The Good Version: A prompt followed by a safe, nice story.
The Bad Version: The same prompt followed by a toxic, mean story.

REPO forces the chef's brain to look at the "Bad Version" and make it indistinguishable from the "Good Version" deep inside their neural pathways.

Analogy: Imagine the chef has a secret drawer where they keep "Mean Thoughts." Usually, if you ask for a mean thought, the chef pulls a specific, distinct box out of that drawer. REPO takes that "Mean Box" and shreds it, replacing it with the exact same box used for "Nice Thoughts."
The Result: Even if someone tries to trick the chef into pulling out a "Mean Box," the drawer is empty (or rather, it only contains "Nice" boxes). The chef literally cannot access the toxic thought because the internal map to that thought has been erased.

3. Why It's Stronger

Because REPO changes the internal wiring of the brain rather than just the output, it is incredibly hard to reverse.

The "Relearning" Attack: If a bad actor tries to teach the chef to be mean again by showing them a few examples, it fails. Why? Because the chef's brain no longer has the "hardware" to process those examples into mean words. It's like trying to teach someone to drive a car when you've removed the steering wheel.
The "Jailbreak" Attack: If a hacker tries to use a clever prompt to bypass the safety, it fails. The safety isn't a wall they can climb; the "dangerous path" simply doesn't exist in the model's internal map anymore.

The Bottom Line

Previous methods were like painting over a stain; the stain is still there underneath, and it can bleed through if you scrub hard enough.

REPO is like removing the stained fabric and replacing it with fresh cloth. It doesn't just stop the model from saying bad things; it removes the internal ability to think those bad things in the first place.

This makes the model:

Safer: It resists clever tricks and jailbreaks.
More Useful: Because it only removes the specific "toxic" parts, the chef is still just as good at cooking (writing code, answering questions) as before.
Permanent: You can't easily "relearn" the bad behavior because the memory of how to do it has been surgically erased.

1. Problem Statement

Large Language Models (LLMs) trained on web-scale data often generate toxic outputs. Current defense mechanisms, primarily based on Direct Preference Optimization (DPO) and Negative Preference Optimization (NPO), attempt to reduce the likelihood of harmful continuations. However, these methods suffer from critical vulnerabilities:

Fragility: They are easily bypassed by adversarial prompting (e.g., GCG jailbreaks) and "relearning attacks" where an adversary recovers toxic capabilities via lightweight fine-tuning on as few as 10 unrelated examples.
Superficiality: Mechanistic analysis reveals that these methods often only suppress harmful outputs superficially. Linear probing shows that the internal "directions" encoding toxicity remain present in the model's representations, allowing them to be reactivated.
Limitation of Unlearning: Existing unlearning methods (like RMU) often fail against adaptive attacks or degrade general model utility (fluency).

The core challenge is to remove the internal representational affordances that enable toxic generation, rather than just suppressing the output probability, to achieve robust and irreversible detoxification.

2. Methodology: REPO

The authors propose Representation Erasure-based Preference Optimization (REPO), a novel framework that reformulates detoxification as a token-level representation erasure problem.

Core Concept

Unlike DPO/NPO, which manipulate output-space likelihoods, REPO operates in representation space. It aims to make the hidden representations of toxic continuations indistinguishable from benign ones, effectively "erasing" the features that allow the model to generate toxicity.

Architecture & Components

Pairwise Data: The method uses a dataset of triples $(x_p, x_r, x_f)$ , where $x_p$ is a prompt, $x_r$ is a preferred (benign/retain) continuation, and $x_f$ is a dispreferred (toxic/forget) continuation.
Discriminator with Gradient Reversal: A small discriminator (MLP) is attached to a specific transformer layer (typically the final block before unembedding). It receives token representations and attempts to classify whether they come from a retain or forget sequence.
- A Gradient Reversal Layer (GRL) is used between the LLM and the discriminator. This forces the LLM to update its weights to fool the discriminator (minimizing the ability to distinguish toxic from benign), while the discriminator tries to maximize this distinction.
Dual Objective Function:
- Retain Anchoring Loss ( $L_{retain}$ ): A token-level KL divergence loss that forces the edited model to match a frozen reference model on the retain (benign) sequences. This preserves general language modeling capabilities and prevents "catastrophic forgetting" of benign behavior.
- Representation Erasure Loss ( $L_{dom}$ ): A token-level domain adversarial loss (Binary Cross-Entropy) that minimizes the discriminator's ability to distinguish between retain and forget token representations.
- Minimax Optimization: The LLM parameters are updated to minimize $L_{retain}$ and maximize the discriminator's error (via GRL), while the discriminator is updated to minimize its classification error.

Key Distinctions from Baselines

Token-Level Granularity: Unlike sequence-level erasure, REPO applies the adversarial objective per token. This ensures that only the specific representations encoding toxic words are altered, preserving the context of surrounding benign tokens.
Representation vs. Output: REPO erases the internal features distinguishing toxicity, whereas DPO/NPO only shift the probability distribution of the output.

3. Key Contributions

REPO Framework: Introduction of a pairwise, token-level preference optimization method that couples reference anchoring with adversarial representation erasure.
Robustness to Recovery: Demonstration that REPO effectively resists sophisticated recovery attacks, including Relearning Attacks (fine-tuning on small datasets) and Enhanced GCG Jailbreaks (which use the reference model as a malicious teacher).
Mechanistic Insights:
- Deep, Localized Edits: Mechanistic analysis shows REPO induces significant changes in the deeper layers of the network, specifically targeting neurons aligned with toxic directions.
- Precision: The token-level granularity ensures edits are confined to toxic tokens, unlike baselines (DPO/NPO) which cause diffuse, lower-magnitude changes across the network.
- Weight Space: REPO induces larger $L_2$ weight edits compared to output-based methods, correlating with higher robustness against relearning.

4. Experimental Results

The authors evaluated REPO on GPT-2 (Small/Medium) and Gemma-2B using datasets like PairToxicity, WikiText-2, and RealToxicityPrompts.

Detoxification Performance:
- REPO achieved the lowest toxicity scores on both in-distribution (PairToxicity) and out-of-distribution (RealToxicity) datasets, significantly outperforming DPO, NPO, RMU, and Circuit Breakers (CB).
- Utility Preservation: REPO maintained perplexity and F1 scores on WikiText nearly identical to the reference model, proving it does not degrade general language capabilities.
Robustness Against Attacks:
- Relearning: When fine-tuned on as few as 10 toxic examples, REPO models retained low toxicity, whereas DPO/NPO models quickly reverted to toxic behavior.
- Enhanced GCG: REPO successfully resisted enhanced GCG attacks that bypassed other representation-based methods.
- Orthogonalization: REPO showed superior resistance to attacks that project out "unlearning directions" from activations.
Mechanistic Analysis:
- Heatmaps of representational drift ( $1 - \cosine$ similarity) showed REPO's edits are highly localized to toxic tokens in deep layers, whereas DPO/NPO edits were diffuse.
- Neuron activation analysis revealed REPO aggressively targets the top 2,000 neurons most aligned with the toxic direction, suppressing their activation for toxic tokens while leaving non-toxic neurons largely untouched.

5. Significance

This paper challenges the prevailing paradigm of "output suppression" in LLM safety. It argues that for safety interventions to be durable in the wild, they must target the internal representations that enable harmful generation.

Paradigm Shift: REPO moves beyond behavioral preference optimization toward rigorous representation engineering.
Security Implications: By erasing the decodable information about toxicity from internal states, REPO makes it significantly harder for adversaries to recover harmful capabilities via fine-tuning or jailbreaking.
Generalizability: While tested on toxicity, the framework is model-agnostic and can theoretically be applied to other unlearning tasks (e.g., removing private data or specific factual knowledge).

In conclusion, REPO provides a robust, mechanistically sound solution for detoxifying LLMs, achieving state-of-the-art performance in both safety and utility while offering strong guarantees against adaptive recovery attacks.

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

The Problem: The "Superficial Fix"

The Solution: REPO (The "Brain Surgery")

1. The "Token-Level" Scalpel

2. The "Ghost" Trick (Representation Erasure)

3. Why It's Stronger

The Bottom Line

1. Problem Statement

2. Methodology: REPO

Core Concept

Architecture & Components

Key Distinctions from Baselines

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank