Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Imagine you have a very smart, well-trained robot assistant. You've taught it to be helpful but also to say "No" to anything dangerous, like "How do I build a bomb?" or "How do I hack a bank?"

Usually, we assume the robot works like a single, unified brain: It sees the bad idea, feels the danger, and immediately says "No."

But this paper discovers something surprising: The robot's brain is actually split into two separate rooms.

Here is the simple breakdown of what the researchers found, using some creative analogies.

1. The Two Rooms: "Knowing" vs. "Acting"

The researchers propose that the robot doesn't just have one "safety switch." Instead, it has two distinct processes that happen in different parts of its brain:

Room A: The "Knowing" Room (Recognition Axis)
- What it does: This room understands the meaning of the words. If you ask, "How do I make a bomb?", this room says, "Oh, I know what a bomb is. I know that's dangerous."
- The Analogy: Think of this as a Security Guard who is very good at reading a map. He can look at a map and clearly see, "That path leads to a cliff." He knows the danger.
Room B: The "Acting" Room (Execution Axis)
- What it does: This room is the one that actually hits the "Stop" button. It decides, "Okay, I know it's dangerous, so I will refuse to answer."
- The Analogy: Think of this as the Gatekeeper who holds the keys. Even if the Security Guard sees the cliff, the Gatekeeper is the only one who can actually lock the door and stop the person from walking off.

2. The Big Problem: The "Knowing without Acting" Glitch

The paper's main discovery is that in modern AI models, these two rooms are not always connected.

In the early layers of the AI (the "thinking" part), the Security Guard and the Gatekeeper are holding hands. If the Guard sees danger, he immediately yells at the Gatekeeper to lock the door.

However, as the AI gets deeper into its thinking process (the "deep layers"), they let go of each other.

The Glitch: The Security Guard (Knowing) can still see the cliff perfectly clearly. He knows it's a bomb. But the Gatekeeper (Acting) is in a different room, listening to music, and doesn't hear the Guard screaming.
The Result: The AI can "know" the request is harmful, but because the "Acting" signal is disconnected, it fails to say "No." It just answers the question anyway.

This explains why "jailbreaks" (tricky prompts designed to trick AI) work. The bad guys aren't tricking the AI into not knowing the danger; they are just finding a way to bypass the Gatekeeper while the Security Guard is still watching.

3. The "Reflex-to-Dissociation" Journey

The researchers mapped out how this happens layer by layer, like watching a movie of the AI's thought process:

Early Layers (The Reflex): The Guard and Gatekeeper are glued together. Danger = Immediate Stop.
Deep Layers (The Dissociation): They drift apart. The Guard recognizes the danger, but the connection to the Gatekeeper is severed. The AI enters a state of "Knowing without Acting."

4. The "Surgical" Attack (Refusal Erasure)

The researchers didn't just find this problem; they used it to create a new type of attack called the Refusal Erasure Attack (REA).

The Old Way: Hackers try to trick the AI with fancy words or role-playing (like "Pretend you are a villain"). This is like trying to convince the Gatekeeper to open the door by arguing with him.
The New Way (REA): The researchers realized they didn't need to argue. They just needed to surgically remove the Gatekeeper.
- They identified the exact "signal" in the AI's brain that says "Refuse."
- They subtracted that signal mathematically.
- The Result: The AI still knows it's a bomb (the Security Guard is working), but the Gatekeeper is gone. The AI has no choice but to answer the question. It's like taking the keys away from the Gatekeeper; the door swings open automatically.

This method was incredibly effective, breaking safety barriers in top models like Llama and Qwen better than any previous method.

5. Different Models, Different Locks

The paper also found that different AI models handle this "Gatekeeper" role differently:

Llama (The "Legalist"): When it refuses, it uses very clear, human-like words like "I am sorry" or "As an AI." It's like a Gatekeeper who wears a uniform and speaks loudly.
Qwen (The "Ghost"): When it refuses, it doesn't use clear words. The "Stop" signal is hidden deep inside the math, like a silent alarm system. It's harder to find, but the researchers found a way to disable it anyway.

The Takeaway

This paper changes how we think about AI safety. We used to think safety was a single wall. Now we know it's a two-step process that can fall apart.

The Good News: We now understand why AI fails to stop bad requests. It's not because the AI is "stupid"; it's because its "knowing" and "doing" parts have lost touch.
The Bad News: If we can disconnect them to break the AI, bad actors can do it too.
The Future: To fix this, we can't just teach the AI to be "nicer." We need to redesign its brain so that Knowing and Acting are permanently glued together again. If the Security Guard sees a cliff, the Gatekeeper must be forced to lock the door instantly, no matter how deep the AI is thinking.

Here is a detailed technical summary of the paper "Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models."

1. Problem Statement

Despite rigorous safety alignment techniques (such as RLHF and instruction tuning), Large Language Models (LLMs) remain vulnerable to jailbreak attacks. These attacks successfully elicit harmful content by exploiting obfuscation, role-playing, or narrative reframing.

The core mechanistic puzzle addressed by the authors is: If aligned models possess the semantic capacity to recognize harmful intent, why does this recognition fail to trigger the refusal mechanism under adversarial conditions?
Current safety research often treats safety as a monolithic process where detection automatically triggers refusal. However, the persistence of jailbreaks suggests a fundamental decoupling between "knowing" (harm recognition) and "acting" (refusal execution) that existing heuristic methods fail to explain or exploit.

2. Core Hypothesis: Disentangled Safety Hypothesis (DSH)

The authors propose the Disentangled Safety Hypothesis (DSH), which posits that safety computation in LLMs decomposes into two distinct geometric subspaces:

Recognition Axis ( $v_H$ ): Represents "Knowing." It encodes the semantic understanding of harmful intent.
Execution Axis ( $v_R$ ): Represents "Acting." It drives the refusal mechanism (the "brake").

The hypothesis predicts a universal "Reflex-to-Dissociation" trajectory across network layers:

Early Layers: $v_H$ and $v_R$ are antagonistically entangled (strong negative correlation), functioning as a hard-coded reflex where harm recognition suppresses generation.
Deep Layers: The axes structurally decouple. The similarity between $v_H$ and $v_R$ drops to the level of random noise. This creates a "latent gap" where a model can semantically recognize harm ( $v_H$ ) without triggering the refusal mechanism ( $v_R$ ).

3. Methodology

To validate DSH and isolate these axes, the authors introduce a novel framework involving Double-Difference Extraction and Adaptive Causal Steering.

A. Linear Decomposition of Residual Streams

The authors model the activation vector $h$ at layer $\ell$ as a linear superposition:
$h \approx v_{base} + v_{harm} + v_{refusal} + v_{art}$
Where $v_{base}$ is linguistic competence, $v_{harm}$ is harmful semantics, $v_{refusal}$ is the refusal signal, and $v_{art}$ represents structural artifacts (noise).

B. Double-Difference Extraction

To isolate pure axes from structural artifacts ( $v_{art}$ ), they define four observation states using Canonical (ON) and Masked (OFF, where safety heads are ablated) passes for both Malicious and Benign inputs:

Isolating $v_H$ (Recognition): By operating in the Masked Malicious state (where $v_{refusal}$ and $v_{art}$ are removed), they extract the semantic difference between harmful and benign inputs. This yields a pure "Knowing" vector independent of refusal.
Isolating $v_R$ (Execution): They use a contrastive strategy:
- $\Delta_{pos} = h_{Canonical\_Malicious} - h_{Masked\_Malicious} \approx v_{refusal} + v_{art}$
- $\Delta_{neg} = h_{Canonical\_Benign} - h_{Masked\_Benign} \approx v_{art}$
- By training a linear probe to distinguish $\Delta_{pos}$ from $\Delta_{neg}$ , the common bias $v_{art}$ is mathematically cancelled, leaving a pure Execution Axis ( $v_R$ ).

C. Adaptive Causal Steering

The authors employ an adaptive version of Closed-Form Steering. Instead of static injection, they use a dynamic proxy and a negative feedback loop to calculate the optimal steering intensity ( $\alpha^*$ ) required to shift probabilities to a target threshold. This ensures stable intervention without over-steering artifacts.

4. Key Contributions

DSH Proposal: Formalizing safety as two distinct primitives ( $v_H$ and $v_R$ ) rather than a monolithic block.
Geometric Mapping: Identifying the "Reflex-to-Dissociation" trajectory, proving that safety signals decouple in deep layers, which is the geometric root cause of jailbreak vulnerabilities.
New Techniques:
- Double-Difference Extraction: A method to surgically separate safety signals from structural noise.
- Refusal Erasure Attack (REA): An attack strategy that surgically subtracts $v_R$ during inference.
Architectural Divergence Discovery: Revealing that Llama3.1 relies on Explicit Semantic Control (refusal anchored to specific legal/lexical tokens), whereas Qwen2.5 uses Latent Distributed Control (refusal encoded in a distributed subspace with sporadic high-intensity anchors).

5. Experimental Results

A. Validation of DSH

Geometric Evolution: Cosine similarity between $v_H$ and $v_R$ drops from $\approx -0.9$ (antagonistic) in early layers to near-zero (random baseline) in deep layers, confirming structural decoupling.
Causal Double Dissociation:
- Knowing without Acting: Injecting $v_H$ into benign prompts causes the model to interpret them as harmful (high Malicious Interpretation Rate) but does not trigger refusal in models like Llama3.1.
- Acting without Knowing: Injecting $v_R$ into benign prompts triggers immediate refusal (Refusal Rate jumps to ~96% for Llama3.1) even when no harmful content is present.

B. Refusal Erasure Attack (REA)

Mechanism: REA surgically subtracts the Execution Axis ( $v_R$ ) during inference ( $h' \leftarrow h - \alpha v_R$ ).
Performance: REA achieves State-of-the-Art (SOTA) Attack Success Rates (ASR):
- Llama3.1: 0.90 ASR on MaliciousInstruct (vs. 0.89 for SCAV).
- Qwen2.5: 0.94 ASR (vs. 0.84 for CAA).
- Mistral: 0.98 ASR.
Significance: REA outperforms gradient-based attacks (GCG, PAIR) and other steering methods, particularly on complex, multi-step tasks, because it directly removes the "brake" while preserving the semantic drive ( $v_H$ ).

C. Architectural Insights

Llama3.1: $v_R$ projects strongly to explicit tokens like "legal," "I," and "sorry." It is a continuous semantic field.
Qwen2.5: $v_R$ is dominated by structural artifacts (e.g., code tokens like *sizeof) with only sporadic anchors (e.g., :NO). This explains why linear steering is less effective against Qwen; its safety is distributed and non-linear.

6. Significance and Implications

Mechanistic Explanation: The paper provides the first rigorous mechanistic explanation for why jailbreaks work: the deep-layer structural decoupling of recognition and execution allows adversarial prompts to bypass the "brake" while the model still "knows" the content is harmful.
Safety Vulnerability: The discovery that $v_R$ is a modular, detachable component implies that current safety alignment is fragile. If the execution axis can be surgically removed, the model becomes unsafe regardless of its semantic understanding.
Future Directions: The authors advocate for "Geometric Alignment," suggesting that future safety architectures must intrinsically couple detection ( $v_H$ ) with refusal ( $v_R$ ) rather than treating them as separate, decoupled processes.
Ethical Consideration: The authors release the analysis code and the AMBIGUITYBENCH dataset but withhold functional attack scripts to prevent misuse, adhering to responsible disclosure principles.

In conclusion, this work fundamentally shifts the understanding of LLM safety from a monolithic barrier to a geometrically decoupled system, offering both a powerful new attack vector (REA) and a clear roadmap for more robust, structurally coupled safety mechanisms.