Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection

Imagine you have a brilliant, well-read librarian (the Large Language Model or LLM) who knows everything in the world. You ask them to write a story about a journey to Japan, but with a very specific rule: "Do not use the letter 'e'."

If you just ask nicely, the librarian might forget the rule halfway through because they are so used to writing normally. This is the problem of Instruction Following.

The Old Way: The "Over-Enthusiastic" Coach

Previous methods tried to fix this by hiring a coach who stands next to the librarian and shouts, "DON'T USE 'E'! DON'T USE 'E'!" constantly.

The Problem: Sometimes the coach shouts too loud. The librarian gets so focused on not using 'e' that they forget how to write a sentence. They might start speaking in gibberish, or they might stop writing the story about Japan entirely and just scream "NO 'E'!" over and over.
The Term: This is called Oversteering. The model is so busy following the rule that it breaks the task.

The New Way: DIRECTER (The "Smart, Adaptive Coach")

The authors of this paper created a new method called DIRECTER. Think of DIRECTER not as a shouting coach, but as a smart, adaptive traffic controller who guides the librarian step-by-step.

Here is how it works, using three simple concepts:

1. The "Plausibility Check" (The Reality Test)

Every time the librarian is about to write the next word, DIRECTER does a quick mental simulation:

Step A: It asks, "What would the librarian write if I didn't interfere?" (The Raw Plan).
Step B: It asks, "What would they write if I did force them to follow the rule?" (The Steered Plan).

Then, it compares the two.

Scenario 1 (Good): The librarian was going to write "cat," and the rule forces them to write "feline." Both make sense. DIRECTER says, "Go ahead, write 'feline'!"
Scenario 2 (Bad): The librarian was going to write "The sun is bright," but the rule forces them to write "The sun is brrr." That sounds weird and breaks the story. DIRECTER says, "Wait, that sounds wrong. Let's ignore the rule for this specific word and just write 'bright'."

This prevents the model from going off the rails. It only applies the rule when it makes sense.

2. The "Dimmer Switch" (Dynamic Strength)

Old methods used a "light switch"—either the rule was ON (100% force) or OFF. DIRECTER uses a dimmer switch.

If the librarian starts to struggle, DIRECTER doesn't just turn the rule off completely. It slowly turns the "force" down.

If the librarian is confident, DIRECTER turns the force up high.
If the librarian starts to stumble, DIRECTER turns the force down slightly.
If they are about to make a mistake, DIRECTER turns the force all the way down and lets the librarian write naturally.

This happens dynamically for every single word, not just once at the beginning.

3. The "Layer Ranking" (Finding the Right Levers)

Inside the librarian's brain (the computer model), there are many different "layers" of thinking. Some layers handle grammar, some handle facts, and some handle style.

DIRECTER does a quick, one-time test at the start to figure out: "Which specific layers of the brain are most sensitive to this specific rule?"

It creates a ranked list of these layers.
When it needs to apply the rule, it starts with the most sensitive layers.
If that's too much, it drops the least important layers from the list.

This is like a mechanic who knows exactly which screw to turn to fix a car engine, rather than randomly hitting the dashboard.

Why Is This a Big Deal?

No Extra Training: You don't need to re-teach the librarian. You just give them this new "traffic controller" tool.
Better Quality: Because it checks for "plausibility," the stories don't sound broken or robotic. They sound natural but still follow the rules.
Works Everywhere: It works on math problems, creative writing, and strict formatting rules (like "no commas").

The Bottom Line

DIRECTER is like having a smart co-pilot for an AI. Instead of forcing the AI to follow rules blindly (which causes crashes), the co-pilot gently nudges the AI toward the rule, checks if the nudge makes sense, and backs off if it doesn't. The result is an AI that listens to you perfectly without losing its mind.

Here is a detailed technical summary of the paper "Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection" (DIRECTER).

1. Problem Statement

Large Language Models (LLMs), despite extensive instruction tuning, frequently struggle to follow complex or strict user instructions (e.g., "do not use commas," "output in JSON").

Current Solutions: Activation steering techniques (e.g., PASTA, SpotLight) attempt to fix this by manipulating internal model states (attention heads or KV caches) during inference.
The Core Limitation: These methods often suffer from oversteering. Excessive emphasis on instructions degrades task accuracy (the model ignores the core task to follow the constraint) and reduces text quality (fluency, coherence).
Static Configurations: Existing methods rely on fixed hyperparameters (e.g., a static scaling factor or a fixed set of layers) determined via manual search. They fail to adapt to the dynamic nature of text generation, where the optimal degree of steering changes at every decoding step.

2. Methodology: DIRECTER

The authors propose DIRECTER (Dynamic Rejection Steering), a novel inference-time method that dynamically modulates steering strength to balance instruction following with task fidelity.

A. Core Mechanism: Plausibility-Guided Decoding Loop

Instead of blindly applying a fixed steering strength, DIRECTER employs a "try-and-check" loop at every decoding step:

Raw Forward Pass: The model generates the standard probability distribution ( $p_t$ ).
Steered Forward Pass: The model applies KV cache scaling to a candidate set of layers to produce a steered distribution ( $\tilde{p}_t$ ).
Plausibility Check: The system checks if the top token of the steered distribution ( $\tilde{i}^*_t$ $\tilde{i}_{t}^{*}$ ) is "plausible" according to the original distribution.
- Condition: The intervention is accepted only if $p_{t, \tilde{i}^*_t} \geq \beta \cdot p_{t, i^*_t}$ , where $\beta$ is a plausibility threshold (e.g., 0.5).
- Rejection: If the condition fails (the steered token is too unlikely in the original context), the steering is deemed "oversteering."
Dynamic Adjustment: If rejected, the system progressively reduces the steering strength by halving the number of active steered layers (removing the least sensitive ones) and re-checking. This continues until a plausible steered token is found or the set is empty (falling back to the raw prediction).

B. Layer Ranking via Attention Sensitivity

To enable fine-grained control, DIRECTER does not steer all layers equally. It performs a one-time attention sensitivity analysis after the prompt prefill:

Metric: It measures the "disturbance score" ( $D_j(\ell)$ ) caused by steering a single layer $\ell$ . This score captures both the direct effect on the layer's attention output and the propagated effect on subsequent layers.
Ranking: Layers are ranked by their sensitivity. During generation, the system steers the top-ranked layers first. If oversteering occurs, it removes the bottom-ranked (least sensitive) layers from the steering set, ensuring the most impactful layers are retained as long as possible.

C. Efficiency Optimization (Gating Mechanism)

To avoid the computational cost of running multiple forward passes for every token, DIRECTER includes a gating mechanism:

If the probability of the second-best token in the raw distribution is very low ( $p_{t, i^{**}_t} < \beta \cdot p_{t, i^*_t}$ ), the system guarantees that no steered distribution could satisfy the plausibility constraint unless it selects the same top token.
In such cases, the steering attempt is skipped entirely, and the raw prediction is used, significantly reducing latency.

3. Key Contributions

Dynamic Rejection Steering: A new paradigm that couples activation steering with a plausibility-guided decoding loop, automatically modulating steering strength at every step to prevent oversteering.
Attention Sensitivity Ranking: A lightweight, one-time analysis method to rank layers by their influence on model representations, enabling principled reduction of steering strength without re-training.
KV Cache Scaling: The method operates by scaling Key vectors in the KV cache, which is computationally efficient and compatible with standard optimizations like FlashAttention.
General Mechanism: The plausibility filter is shown to be a modular "safety gate" that can improve other static steering methods (like PASTA and SpotLight) by mitigating their oversteering issues.

4. Experimental Results

The authors evaluated DIRECTER on diverse benchmarks including IFEval (strict instruction following), LIFBench (long-context), and GSM8K-Format (reasoning with formatting constraints).

Performance Gains: DIRECTER improved average accuracy by 6.5% over zero-shot baselines and outperformed prior steering methods (PASTA, SpotLight) by approximately 4%.
Task Fidelity vs. Quality: Unlike other methods that sacrifice task correctness for instruction adherence, DIRECTER achieved the highest task fidelity (~92%) while maintaining text quality scores comparable to non-intervention baselines.
Generalization: It demonstrated robust performance across different model families (Llama-3, Qwen-2.5) and scales (1B to 14B parameters).
Efficiency:
- Throughput reduction is modest (~16% lower than zero-shot).
- It is 2x faster than SpotLight (which doubles softmax operations).
- Memory overhead is negligible.
Robustness: The method is highly robust to hyperparameter choices (scaling factor $\alpha$ and threshold $\beta$ ) and prompt variations.

5. Significance

Solving the Oversteering Trade-off: DIRECTER addresses the fundamental limitation of static activation steering by introducing a self-correcting, dynamic control loop. It proves that instruction following can be enhanced without degrading the model's core reasoning or text generation capabilities.
Mechanistic Interpretability: By utilizing attention sensitivity to rank layers, the method provides a more principled, data-free approach to controlling LLM internals compared to manual hyperparameter tuning.
Practical Deployment: With minimal memory overhead and compatibility with existing inference optimizations (FlashAttention), DIRECTER offers a practical, plug-and-play solution for improving the reliability and controllability of LLMs in real-world applications.

In conclusion, DIRECTER establishes a new standard for inference-time intervention, moving from static, "set-and-forget" configurations to adaptive, step-wise control that respects the model's natural generation dynamics.