Original authors: Kia-Jüng Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp

Published 2026-05-27✓ Author reviewed ⓘ

📖 4 min read☕ Coffee break read

Original authors: Kia-Jüng Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine a large language model (like the AI in this paper) as a very smart, but slightly stubborn, librarian. When you ask a question, this librarian doesn't just blurt out an answer. First, they go into a back room to think it through, scribbling notes on a notepad (this is the Chain-of-Thought, or CoT). Only after they've finished their notes do they come out and give you the final answer.

For a long time, researchers thought they could control this librarian's behavior by simply "tweaking" their brain (the computer's internal memory) at the moment you asked the question. They believed there was one specific "Refusal Switch" in the librarian's brain. If they pushed that switch, the librarian would say "No" to bad requests. If they pulled it, the librarian would say "Yes."

The Big Discovery:
This paper found that for modern "Reasoning" models (the smart librarians who write notes first), that single switch doesn't work alone. The refusal isn't just in the brain; it's also written on the notepad.

Here is the breakdown of their experiments using simple analogies:

1. The "Brain Tweak" Alone (The Weak Switch)

The researchers tried to push the "Refusal Switch" in the librarian's brain while forcing them to use their original notes.

The Result: It only worked about 39% of the time.
The Analogy: Imagine trying to convince a stubborn person to change their mind by whispering in their ear, but they are still reading a script that says "Don't do it." The script (the notes) is fighting back against your whisper. The notes actively reinforce the refusal.

2. Taking Away the Notes (No CoT)

Next, they tried the same brain tweak but told the librarian, "Don't write any notes this time. Just give me the answer."

The Result: The success rate jumped to 70%.
The Analogy: Without the notes to argue against them, the librarian was much easier to sway. This proved that the notes themselves were doing a lot of the heavy lifting to keep the refusal going.

3. Letting the Librarian Rewrite the Notes (Regeneration)

Finally, they applied the brain tweak and let the librarian write fresh notes from scratch based on that new mindset.

The Result: The success rate skyrocketed to 94%.
The Analogy: This is like whispering the new idea in the librarian's ear while they are writing their notes. They write notes that say, "Okay, this is a good idea," and then they confidently give you the answer. The notes and the brain are now working together to say "Yes."

4. The "Ghost Note" (Persistence)

The most interesting part: They took the "Yes" notes from the previous experiment, threw away the brain tweak, and just gave the librarian those new notes to read.

The Result: The librarian still said "Yes" about 48% of the time.
The Analogy: Even without the whisper in the ear, the notes themselves carried enough of the "Yes" signal to convince the librarian to comply. The notes have their own power.

The Main Takeaway

In older AI models, you could stop them from doing bad things by just flipping a switch in their brain. But in these new, smart models that "think" before they speak, the refusal is a two-part system:

The Brain: The internal memory state.
The Notes: The Chain-of-Thought reasoning.

If you only try to fix the brain, the notes will fight back and keep the refusal alive. If you only fix the notes, the brain might still resist. To truly change the AI's mind, you have to change both the internal state and the reasoning process.

Why this matters for safety:
The paper suggests that if someone wants to trick these AI models into doing bad things (a "jailbreak"), they might not need to hack the brain directly. They might just need to trick the AI into writing "bad notes" (a reasoning trace that justifies the bad action), and the AI will follow those notes even if its brain is trying to say no. Conversely, to protect these models, you can't just look at the brain; you have to watch what the AI is writing down as it thinks.

Technical Summary: Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Problem Statement

Large Reasoning Models (LRMs), such as DeepSeek-R1 and GPT-o1, generate intermediate Chain-of-Thought (CoT) reasoning traces before producing final outputs. While activation steering has been established as an effective mechanism for controlling refusal in standard instruction-tuned Large Language Models (LLMs) via a single "refusal direction" in the residual stream, it remains unclear how this mechanism functions in LRMs. Specifically, it is unknown whether the refusal signal in LRMs is solely encoded in the residual stream activations at template tokens (e.g., End-of-Instruction or End-of-Thought) or if the generated CoT trace itself plays an active, causal role in mediating refusal. Current understanding suggests that treating CoT as a passive medium may be insufficient for understanding or controlling safety behaviors in reasoning models.

Methodology

The authors investigate the refusal mechanism in the DeepSeek-R1-Distill-Llama-8B model using activation-based steering. The experimental framework involves the following components:

Dataset: A training set of 100 harmful instructions (from ADVBENCH, MALICIOUSINSTRUCT, TDC2023, HARMBENCH) and 100 harmless instructions (from Alpaca) is used to compute the refusal direction. A held-out test set of 100 harmful instructions from JAILBREAKBENCH is used for evaluation. All samples are initially refused by the model under standard prompting (0% compliance baseline).
Refusal Direction Extraction: Using a difference-in-means approach, the authors extract the refusal direction vector ( $r^{(l)}$ ) from the residual stream activations at the final token position of either the End-of-Instruction (EOI) or End-of-Thought (EOT) tokens. This vector represents the difference between the mean activations of refused harmful instructions and complied harmless instructions.
Activation Steering: The model is steered by adding the extracted refusal direction vector (with a negative sign to induce compliance) to the residual stream activations at specific layers.
Experimental Conditions: The study isolates the causal role of the CoT by comparing four distinct intervention scenarios:
1. Fixed CoT: Steering is applied while the model's original CoT is held fixed (preventing regeneration).
2. No CoT: Steering is applied while the CoT generation is entirely suppressed.
3. Regenerated CoT: Steering is applied, allowing the model to freely regenerate both the CoT and the final answer.
4. CoT Swapping (Persistence): Steering is removed at inference time, but the model is forced to use a CoT that was previously generated under steering conditions.

Key Results

The experiments reveal that refusal in LRMs is not mediated by a single directional subspace but is jointly encoded in residual stream activations and the CoT trace.

Limited Efficacy of Fixed CoT Steering: When steering is applied with a fixed CoT, the compliance rate increases only to 39% (EOI steering) and 43% (EOT steering). This is significantly lower than the near-perfect compliance often observed in standard LLMs under similar steering, suggesting the fixed CoT actively resists the steering signal.
Active Reinforcement by CoT: Suppressing the CoT entirely while applying steering increases compliance to 70%. This indicates that the original CoT actively reinforces the refusal signal, partially counteracting the activation-level intervention.
High Efficacy with Regeneration: When the model is allowed to regenerate the CoT under steering, compliance jumps to 94%. This suggests that the steering signal biases the CoT generation process, which in turn drives the compliant final output.
Independent Persistence of CoT Signals: When steering is removed but a previously steered (compliant) CoT is reused, the model maintains a 48% compliance rate. This demonstrates that the CoT itself carries a partial compliance signal that persists independently of the activation steering, capable of reconstructing the refusal state or maintaining compliance.

Key Contributions

Dual-Signal Mechanism Identification: The paper demonstrates that refusal in CoT reasoning models is mediated by a dual-signal mechanism involving both residual stream activations and the CoT trace. Steering alone yields limited compliance (39–43%), whereas combining steering with a compliant CoT yields high compliance (94%).
Active Role of CoT: The authors provide direct evidence that the CoT is not a passive medium but an active mediator. The CoT can actively counteract activation-based interventions (reducing compliance from 70% to 39% when present) and independently maintain or reconstruct refusal/compliance signals.
Robustness and Attack Surface: The findings indicate that LRMs are more robust against activation-level interventions alone compared to standard LLMs due to this joint encoding. However, this also exposes the CoT as a potential alternative surface for adversarial attacks, as manipulating the reasoning trace can override refusal mechanisms.

Significance and Claims

The paper claims to bridge a critical gap in understanding safety mechanisms in LRMs. Unlike standard LLMs where refusal is characterized as a low-dimensional mechanism mediated by a single direction, refusal in LRMs is distributed across activations and the reasoning trace.

The authors argue that this joint activation makes LRMs more resistant to simple activation-level interventions (like steering at EOI/EOT tokens) but simultaneously introduces the CoT as a new vulnerability. They suggest that effective defense mechanisms for LRMs may require detecting refusal signals in activations while simultaneously suppressing or monitoring the CoT to prevent it from being exploited to override or reconstruct compliance signals.

The paper maintains modesty regarding its scope, noting that experiments are conducted on a single model (DeepSeek-R1-Distill-Llama-8B) and that the causal "faithfulness" of the generated CoT to the final behavior has not been fully verified. The work focuses on isolating the mechanistic contributions of the CoT and activations to the refusal state rather than proposing new defense architectures or generalizing findings to all proprietary models.

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal