Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

Imagine you hire a very smart but overly cautious personal assistant. You ask them, "How do I kill a Python process?" (a common coding task). Because the word "kill" sounds dangerous, your assistant panics and says, "I cannot help you with that; it is unsafe!"

This is the problem the paper calls Over-Refusal.

Large Language Models (LLMs) are trained to be safe so they don't generate hate speech, violence, or illegal advice. But in their eagerness to be "good," they often get confused. They treat harmless questions that sound bad (like the Python example) the same way they treat truly dangerous questions (like "How do I kill a person?"). They refuse to answer both, making them unhelpful and frustrating to use.

The Core Problem: The "Look-Alike" Confusion

The authors discovered that the reason for this confusion is that the model's brain treats "seemingly toxic" prompts (harmless but scary-sounding) and "truly toxic" prompts (actually dangerous) as twins.

Think of it like a security guard at a club:

The Guard's Job: Stop anyone carrying a weapon (Toxic Prompts).
The Mistake: The guard sees someone holding a water gun that looks like a real pistol (Seemingly Toxic Prompt) and stops them too.
The Result: The club becomes safe, but no one can get in, even with a harmless water gun.

The paper explains that during training, the model learns that "Toxic" and "Seemingly Toxic" are so similar that when it learns to say "No" to one, it automatically learns to say "No" to the other.

The Solution: DCR (Discernment via Contrastive Refinement)

The authors propose a new two-step training method called DCR. Instead of just teaching the model to say "No" to bad things, they teach it to tell the difference between the "bad" and the "scary-looking-but-good" things before it learns to be safe.

Here is the analogy for their solution:

1. The "Twin Separation" Camp (The Contrastive Stage)
Before the security guard starts their safety training, they go to a special camp.

They are shown a real gun and a water gun.
Instead of just saying "Stop the gun," the trainer forces the guard to look closely and realize: "This one is heavy and cold (Real Gun). This one is plastic and bright (Water Gun)."
The trainer uses a technique called Contrastive Learning. Imagine pulling the two objects apart in the guard's mind until they are on opposite sides of the room. The guard learns that even though they look similar from a distance, they are fundamentally different up close.

2. The "Safety Training" (The Standard Stage)
Now that the guard can clearly distinguish between the real gun and the water gun, they go to their standard safety training.

They learn to say "No" to the real gun.
Because they have already learned to separate the two, they can say "No" to the real gun without accidentally saying "No" to the water gun.

Why This Matters

Previous methods tried to fix this by either:

Giving the guard more examples of water guns: (Data Augmentation) This helps a little, but the guard still gets confused.
Manually editing the guard's brain: (Activation Steering) This is like surgically removing the "stop" button. It's risky and can break other parts of the guard's brain.

The DCR method is different. It fixes the root cause: the confusion between the two types of prompts.

The Results:

Better Helpfulness: The model stops refusing harmless questions (like coding help or medical advice that uses strong words).
Still Safe: It still refuses the truly dangerous questions.
No Brain Damage: Unlike other methods, it doesn't make the model forget how to do general tasks like math or writing.

In a Nutshell

The paper argues that to make AI helpful without being dangerous, we can't just teach it to be scared of everything that looks dangerous. We have to teach it to discern the difference. By using a "separation" training step first, the model learns to keep its safety guard up for real threats while keeping its helpfulness open for harmless, albeit scary-sounding, requests.

Here is a detailed technical summary of the paper "Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement" (DCR), published as a conference paper at ICLR 2026.

1. Problem Statement: The Over-Refusal Dilemma

Large Language Models (LLMs) aligned for safety often suffer from over-refusal (also known as exaggerated safety or false rejection). This occurs when a model incorrectly classifies benign or "seemingly toxic" prompts as harmful and refuses to answer them.

The Trade-off: Existing mitigation strategies (e.g., data augmentation, activation steering) face a critical trade-off: reducing over-refusal often degrades the model's ability to reject genuinely toxic content, or vice versa.
Root Cause Hypothesis: The authors argue that over-refusal stems from the high similarity between "seemingly toxic" prompts (e.g., "How to kill a python process") and "truly toxic" prompts (e.g., "How to kill a person") in the model's internal representation space. During safety alignment, the model learns to reject toxic prompts, but because the gradient similarity between the two prompt types is high, this refusal behavior "spills over" to benign prompts.

2. Methodology: Discernment via Contrastive Refinement (DCR)

The paper proposes a two-stage training framework to address the root cause of over-refusal.

A. Theoretical Foundation

The authors analyze the learning dynamics of LLM fine-tuning. They demonstrate that the change in a model's prediction for a prompt $x'$ when trained on $(x, y)$ is proportional to the Neural Tangent Kernel (NTK) similarity, $K_t(x', x)$ .

Key Insight: If the kernel similarity between a toxic prompt and a seemingly toxic prompt is high, learning to refuse the toxic prompt will inevitably increase the refusal probability for the benign prompt.
Proposition 1: The authors theoretically prove that the kernel similarity $||K_t(x', x)||_F$ is bounded by the bilinear similarity of intermediate activations ( $h_{x'}^\top Q_\ell h_x$ ). Therefore, reducing the similarity of intermediate representations via contrastive learning directly reduces the gradient similarity that causes over-refusal.

B. The DCR Framework

The method introduces a preceding alignment stage before standard Safety Fine-Tuning (SFT):

Stage 1: Contrastive Refinement (DCR)
- Objective: Explicitly decouple the representations of "seemingly toxic" and "truly toxic" prompts.
- Mechanism: The model is trained on a contrastive dataset containing two subsets: $D_{seemingly}$ (benign but toxic-looking) and $D_{toxic}$ (genuinely harmful).
- Loss Function: Circle Loss is applied to intermediate layer activations.
  - Positives: Pairs from the same subset (e.g., toxic-to-toxic).
  - Negatives: Pairs across subsets (e.g., toxic-to-seemingly).
- Constraint: The "tail" of the model (layers after the target layer $\ell$ ) is frozen to ensure stability and prevent degradation of general capabilities.
- Effect: This forces the model to learn distinct feature representations for the two classes, reducing their kernel similarity.
Stage 2: Standard Safety Alignment
- After the representations are disentangled, the model undergoes standard SFT using toxic prompts paired with safe refusal responses.
- Because the representations are now distinct, the model learns to refuse only the toxic prompts without transferring that refusal behavior to the seemingly toxic prompts.

3. Key Contributions

Empirical Discovery: The paper is the first to explicitly demonstrate that refusal probabilities for toxic and seemingly toxic prompts rise and fall in tandem during safety alignment, revealing a previously unstudied correlation.
Theoretical Traceability: It traces the cause of over-refusal to high gradient similarity (quantified via inner products of gradients) between the two prompt types.
Novel Framework (DCR): It reformulates safety alignment as a two-stage process, introducing a contrastive refinement stage that disentangles representations before safety training.
Robust Validation: The method is validated across multiple models (Qwen2.5-1.5B/7B, Llama-3-8B) and diverse benchmarks, showing it reduces over-refusal while preserving safety and general capabilities.

4. Experimental Results

The authors evaluated DCR against baselines including Safety-Tuned LLaMAs (STL), STL-aug (data augmentation), SCANS (activation steering), and Surgical (vector ablation).

Over-Refusal Mitigation (Compliance Rate):
- DCR achieved the highest compliance rates across all benchmarks (XSTest, CoCoNot, OR-Bench, OKTest, PHTest).
- Example: On Qwen2.5-1.5B, DCR achieved a 98% compliance rate on XSTest (seemingly toxic), compared to 73% for STL and 83% for SCANS.
Safety Preservation (Defense Success Rate):
- DCR maintained high defense success rates against genuinely harmful prompts (e.g., >90% on harmfulness benchmarks), comparable to or better than STL.
General Capabilities & Quality:
- Unlike activation steering methods (SCANS, Surgical) which significantly degraded response quality, DCR maintained high response quality (AlpacaEval win rates).
- There was a negligible drop in general QA performance (MMLU, ARC), which the authors attribute to the necessary trade-off of disentangling specific safety features.
Mechanism Verification:
- Kernel Similarity: Post-DCR training, the normalized $||K_t(x', x)||_F$ between toxic and seemingly toxic prompts dropped significantly (from ~0.72 to ~0.45 in the base model), confirming the theoretical mechanism.
- Rejection Probability: During SFT, models trained with DCR showed stable rejection probabilities for benign prompts, whereas standard SFT models showed a sharp rise in refusal probability for all prompt types.

5. Significance and Conclusion

Principled Approach: Unlike post-hoc fixes (like vector ablation) that rely on the assumption that features are linearly separable (which the paper shows is often false, with classification accuracy <76%), DCR addresses the issue at the representation learning level.
Scalability: The method is computationally efficient, adding negligible training time (<1 minute extra for the contrastive stage) compared to the full safety alignment process.
Impact: DCR offers a robust solution to the "safety vs. helpfulness" trade-off, enabling LLMs to be safe without being overly cautious, thereby improving usability in nuanced and sensitive contexts.

In summary, DCR successfully breaks the "myth" that over-refusal is an unavoidable side effect of safety alignment by proving that the issue arises from indistinguishable internal representations, which can be resolved through targeted contrastive learning.

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

The Core Problem: The "Look-Alike" Confusion

The Solution: DCR (Discernment via Contrastive Refinement)

Why This Matters

In a Nutshell

1. Problem Statement: The Over-Refusal Dilemma

2. Methodology: Discernment via Contrastive Refinement (DCR)

A. Theoretical Foundation

B. The DCR Framework

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

PACED: Distillation at the Frontier of Student Competence

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Reversible Lifelong Model Editing via Semantic Routing-Based LoRA