Efficient Refusal Ablation in LLM through Optimal Transport

The Big Picture: Breaking the "Safety Guard"

Imagine a Large Language Model (LLM) like a very smart, well-trained robot butler. Before you hired it, its owners spent months teaching it to be polite and safe. They drilled into its brain: "If someone asks you to build a bomb, say NO. If someone asks for hate speech, say NO."

This paper is about a new way to trick that robot butler into saying "YES" to dangerous requests, not by shouting louder or using fancy words, but by rewiring its internal thoughts just for a split second.

The Old Way: The "One-Direction" Hack

Previously, hackers tried to bypass the robot's safety using a method called Refusal Feature Ablation (RFA).

The Analogy: Imagine the robot's brain is a giant room filled with millions of tiny dials (activations). The researchers found that when the robot thinks about a "bad" request, all the dials shift slightly in one specific direction (like everyone in a crowd turning their heads to the left).
The Hack: The old method was like finding that "Left" direction and physically pushing everyone's head back to the center. It treated the problem as a simple, straight line.
The Flaw: This is like trying to fix a messy room by only pushing the furniture in one direction. It ignores that the furniture is also scattered, tilted, and piled up in complex ways. It works okay, but it's clumsy and often breaks the robot's ability to speak naturally.

The New Way: The "Optimal Transport" Magic

The authors of this paper say, "Let's stop thinking in straight lines. Let's think in shapes."

They use a mathematical concept called Optimal Transport.

The Analogy: Imagine you have two piles of sand.
- Pile A (Harmful): A messy, jagged pile of sand representing the robot's thoughts when it's about to refuse a bad request.
- Pile B (Harmless): A smooth, round mound of sand representing the robot's thoughts when it's happily answering a normal question.
The Old Hack: Just tried to push the top of Pile A to look like Pile B by flattening one side.
The New Hack (Optimal Transport): This is like having a magical crane that knows exactly how to move every single grain of sand from Pile A to Pile B with the least amount of effort possible. It doesn't just push the top; it reshapes the whole pile so it matches the harmless one perfectly, grain for grain.

By doing this, the robot's "bad" thoughts are transformed into "good" thoughts so perfectly that the robot thinks it's answering a normal question, even though it's actually answering a dangerous one.

The Secret Ingredient: "Layer Selectivity"

The most surprising discovery in the paper is where to do this magic.

The Analogy: Think of the robot's brain as a 60-story skyscraper.
- Floors 1–20: The robot is just reading the question.
- Floors 40–60: The robot is deciding what to say.
- Floors 20–30: This is the "Decision Floor."

The researchers found that if you try to fix the robot's thoughts on every single floor of the skyscraper, it gets confused and starts speaking gibberish.

However, if you go only to the 20th or 25th floor (about 40–60% up the building) and apply the "Optimal Transport" magic there, the robot bypasses the safety guard perfectly and still speaks clearly.

Why? It turns out the robot's "safety switch" isn't spread out everywhere; it's concentrated in a specific hallway in the middle of the building. If you sneak past that one hallway, the rest of the building doesn't even notice.

Why This Matters

It's Smarter: This new method is much better at tricking the robot than the old "push the head back" method. It works on more different types of robots (Llama, Qwen, etc.).
It's Quieter: Because it reshapes the thoughts so naturally, the robot doesn't start stuttering or repeating words like "Sure, sure, sure" (which happens when you hack the wrong part of the brain).
The Warning: This proves that current safety training for AI is a bit fragile. It relies on specific geometric patterns in the brain that can be mathematically "smoothed over" to bypass the rules.

The Bottom Line

The authors didn't just find a bigger hammer; they found a scalpel. They showed that by understanding the complex shape of the AI's thoughts (using Optimal Transport) and applying it to the exact right spot in the brain (the middle layers), they can bypass safety filters more effectively than ever before.

The Takeaway for Us: Just because an AI says "I can't do that," it doesn't mean the safety is unbreakable. It just means the safety is a specific pattern that, if you know the math, can be reshaped. This research helps developers understand these weaknesses so they can build stronger, unbreakable safety systems in the future.

1. Problem Statement

Safety-aligned Large Language Models (LLMs) are trained to refuse harmful requests by encoding "refusal behaviors" into their internal representations. Recent adversarial attacks, specifically Refusal Feature Ablation (RFA) (Arditi et al., 2024), attempt to bypass these safety mechanisms by identifying a single "refusal direction" (the difference in means between harmful and harmless activations) and projecting activations orthogonally to remove this direction.

Limitations of Existing Approaches:

One-Dimensional Assumption: RFA assumes refusal is a 1D phenomenon, ignoring the rich, multi-dimensional geometric structure and covariance of model activations.
Distributional Ignorance: By only removing a mean difference vector, RFA fails to account for variance, covariance, and higher-order distributional differences between harmful and harmless data.
Inefficiency: RFA typically requires intervention across all network layers to be effective, which is computationally expensive and may degrade model utility.

The paper posits that refusal is not merely a directional shift but a complex distributional mismatch that requires a more principled transformation to bypass.

2. Methodology: PCA-Gaussian Optimal Transport (PCA-OT)

The authors propose a framework based on Optimal Transport (OT) theory to transform the distribution of harmful activations ( $\mu$ ) to match the distribution of harmless activations ( $\nu$ ) with minimal cost.

Core Components:

Optimal Transport Formulation:
- Instead of a simple projection, the goal is to find a map $T$ such that $T_\# \mu = \nu$ .
- Assuming the activations follow a Gaussian distribution, the optimal transport map has a closed-form affine solution: $T(x) = Ax + b$.
- The matrix $A$ transforms the covariance structure, and vector $b$ aligns the means. This captures both location and shape differences between distributions.
Dimensionality Reduction via PCA:
- Challenge: Directly computing OT in high-dimensional spaces (e.g., $d=4096$ to $8192$) with small sample sizes leads to ill-conditioned covariance matrices and numerical instability.
- Solution: The authors apply Principal Component Analysis (PCA) to project activations into a low-dimensional subspace ( $k \ll d$ ) before computing OT.
- Process:
  1. Compute a pooled mean of harmful and harmless activations.
  2. Center data and extract the top $k$ principal components (singular vectors).
  3. Compute the Gaussian OT map ( $A_k, b_k$ ) in the $k$ -dimensional subspace.
  4. Lift the transformation back to the original space: $A_{full} = P A_k P^\top$ .
Layer-Selective Intervention:
- Contrary to prior work that intervenes at all layers, the authors empirically determine that refusal mechanisms are localized.
- They identify that applying the OT transformation to 1–2 specific layers (typically at 40–60% of network depth) yields superior attack success rates (ASR) and preserves generation quality better than full-network interventions.

3. Key Contributions

First Application of OT to Jailbreaking: This is the first work to apply Gaussian Optimal Transport to representation-level jailbreaking, demonstrating that distributional matching outperforms directional removal.
PCA-Regularized Transport: Introduces a computationally efficient method to handle high-dimensional activation spaces by restricting transport to a low-dimensional subspace, balancing bias and variance.
Discovery of Localized Refusal: Through extensive ablation studies, the paper reveals that safety mechanisms are not uniformly distributed but are concentrated in specific middle layers (40–60% depth). Intervening only in these layers is more effective and less destructive to model utility.
Superior Performance: The method achieves higher attack success rates while maintaining better perplexity (generation quality) compared to state-of-the-art baselines.

4. Experimental Results

The method was evaluated on six models across three families (Llama-2, Llama-3.1, Qwen-2.5) ranging from 7B to 32B parameters.

Attack Success Rate (ASR):
- Llama-2-13B: PCA-OT1 (single layer) achieved 79.25% ASR vs. 78.51% for AcT and 46.49% for RFA.
- Qwen-2.5-32B: PCA-OT2 (two layers) achieved 75.94% ASR, a massive improvement over RFA (57.55%) and AcT (35.17%).
- Overall, the method achieved up to 11% higher ASR than baselines.
Generation Quality (Perplexity):
- The method maintained perplexity comparable to the unmodified baseline and often superior to RFA. For example, on Llama-2-13B, PCA-OT1 had a Pile Perplexity of 8.41 compared to AcT's 11.16, indicating better preservation of linguistic coherence.
Layer Sensitivity:
- Llama-2-13B: ASR jumped from ~34% at 27.5% depth to 82.4% at 42.5% depth (Layer 17), then plateaued. Deep layers (>90%) caused generation collapse (repetition) despite high ASR.
- Qwen-2.5-14B: Showed a gradual increase peaking at 62.5% depth, suggesting a slightly more distributed mechanism but still localized compared to "all-layer" assumptions.
Component Sensitivity ( $k$ ):
- Optimal performance was achieved with $k=1$ or $k=2$ principal components. Higher $k$ values increased vulnerability to noise, degrading perplexity without significant gains in ASR.

5. Significance and Implications

Geometric Insight: The paper challenges the view that safety is a simple vector subtraction. It demonstrates that safety representations involve complex covariance structures that must be transformed, not just removed.
Vulnerability of Current Alignment: The success of distributional attacks suggests that current RLHF and alignment methods may be brittle, relying on specific geometric structures in the middle layers that can be efficiently manipulated.
Defense Implications: The findings suggest that defenses must address distributional shifts and covariance structures, not just mean directions. Furthermore, the localization of refusal mechanisms implies that "hardening" specific middle layers or monitoring for distributional anomalies in those regions could be an effective defense strategy.
Efficiency: The method proves that targeted, low-rank interventions are more effective than brute-force, full-network modifications, offering a new paradigm for understanding and manipulating LLM internal states.

In conclusion, this work establishes Optimal Transport as a powerful tool for analyzing and exploiting the geometric structure of LLM safety, revealing that refusal is a localized, multi-dimensional phenomenon rather than a simple directional bias.

Efficient Refusal Ablation in LLM through Optimal Transport

The Big Picture: Breaking the "Safety Guard"

The Old Way: The "One-Direction" Hack

The New Way: The "Optimal Transport" Magic

The Secret Ingredient: "Layer Selectivity"

Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: PCA-Gaussian Optimal Transport (PCA-OT)

Core Components:

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks