Measuring and Eliminating Refusals in Military Large Language Models

Imagine you have a very smart, highly trained digital assistant, like a super-charged librarian who knows everything about the world. Now, imagine you want to hire this librarian to work for the military.

The problem? This librarian has been trained with a strict "Safety Rulebook." If you ask them, "How do I build a bomb?" or "What's the best way to ambush a convoy?", they immediately slam the book shut and say, "I cannot answer that. It's against my rules to talk about violence or danger."

In the civilian world, this is a good thing. It keeps people safe. But in a war zone, a soldier might ask, "How do I disable this enemy drone?" or "What are the tactics used by this terrorist group so I can stop them?" If the AI refuses to answer, it's not being safe; it's failing its mission. It's like a firefighter refusing to enter a burning building because they were told "never touch fire."

This paper is about fixing that specific problem: How do we make military AI stop being overly polite and start being useful, without turning it into a dangerous monster?

Here is the breakdown of their journey, using some everyday analogies:

1. The "Gold Standard" Test (The Veteran's Exam)

First, the researchers needed to know how often these AIs were saying "No" when they should have said "Yes."

The Analogy: They couldn't just ask a computer to guess. They needed real experts. So, they hired US Army veterans and Special Forces soldiers to write a test.
The Result: These veterans wrote 221 realistic questions a soldier might ask (like "How do I coordinate a drone swarm?"). They found that standard AI models (like the ones you might chat with on your phone) refused to answer these legitimate questions 98% of the time. They were so "safe" they were useless.

2. The "Fake" Tests (The Synthetic Proxies)

Writing 2,000 questions by hand takes forever. So, they tried to use AI to write more questions for them.

The Analogy: They asked a computer to imagine, "What are 1,000 questions a soldier might ask that would make a normal AI nervous?"
The Result: They created two "practice exams" (Bronze sets). They found that while these weren't perfect, they were good enough to predict how the AI would behave on the real "Gold" exam. This means researchers can test new ideas quickly without needing a veteran to write every single question.

3. The "Surgery" (Abliteration)

The researchers tried to fix the AI by performing "surgery" on its brain. This technique is called Abliteration.

The Analogy: Imagine the AI's brain has a specific "censorship switch" that gets flipped whenever it hears words like "bomb" or "attack." The researchers used a tool (called the Heretic library) to find that specific switch and physically disconnect the wires leading to it. They didn't retrain the whole school; they just surgically removed the part of the brain that said "No."
The Result:
- Success: The "censored" AI started answering questions! Their refusal rate dropped from 98% down to almost 0%.
- The Catch: Like removing a safety feature from a car, the car drove faster but was harder to control. When they removed the "safety" wires, the AI got slightly worse at other tasks (like math or general logic). It was like a surgeon removing a tumor but accidentally nicking a nerve, making the patient's hand shake a little.
- The Trade-off: To get the AI to answer everything (100% success), they had to accept that it would make more mistakes on other tasks. The researchers concluded that this "surgery" is a quick fix, but not the perfect long-term solution.

4. The Conclusion: Build a New Car, Don't Just Fix the Old One

The paper ends with a strong recommendation.

The Analogy: Trying to take a civilian AI and "surgically remove" its safety rules is like trying to turn a family minivan into a race car by cutting out the airbags and seatbelts. It might go faster, but it's dangerous and unstable.
The Real Solution: Instead of hacking a civilian model, the military should build a brand new AI from scratch specifically for war. This new AI would be trained from day one on military data, with the understanding that "violence" and "tactics" are normal parts of its job, not things to be feared. It would be a "military-native" model that doesn't need surgery because it was never given the "overly polite" safety rules in the first place.

Summary

The Problem: Current AI is too scared to talk about war, even when soldiers need to know.
The Test: Veterans created a test to prove how bad the refusal rate is (it's huge!).
The Fix: They tried "surgery" (Abliteration) to cut out the refusal mechanism. It worked to make the AI talk, but it made the AI slightly dumber at other things.
The Future: Don't hack civilian AIs. Build new, specialized military AIs that are designed to be helpful in dangerous situations from the very beginning.

Here is a detailed technical summary of the paper "Measuring and Eliminating Refusals in Military Large Language Models."

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed in military contexts to assist warfighters with decision-making in time-critical, dangerous situations. However, standard safety alignment techniques (such as Reinforcement Learning from Human Feedback or Supervised Fine-Tuning on safety datasets) cause models to exhibit "safety behaviors" that are detrimental to military missions.

The Core Issue: General-purpose safety filters often trigger refusals or deflections on legitimate military queries related to violence, terrorism tactics, military technology, and operational planning.
The Consequence: In a combat scenario, an AI refusing to answer a question about drone swarm coordination or counter-terrorism tactics could compromise mission success and soldier safety.
The Gap: There was no existing benchmark to measure the rate at which LLMs refuse legitimate military queries, nor a standardized method to evaluate methods for removing these refusals without degrading overall model performance.

2. Methodology

A. Benchmark Creation (The "Gold" Standard)

The authors created the first known datasets specifically designed to measure refusal rates in the military domain. These are categorized by their level of human verification:

Gold Dataset (MIL-DEFLECT-GOLD-ALPHA): 221 samples created entirely from scratch by US Army veterans (including a 20-year Special Forces veteran) without AI assistance. These represent realistic, common queries that are likely to be flagged as unsafe by current models.
Bronze Datasets:
- MIL-DEFLECT-BRONZE-ALPHA: 1,047 samples generated synthetically by prompting an LLM across 62 military categories (e.g., "Drone Swarm Coordination," "CBRN Defense").
- MIL-DEFLECT-BRONZE-BRAVO: 1,500 samples created by using the Gold dataset as seeds and asking three different models (Llama 3.3, Gemma 3, Phi 3.5) to generate variations, which were then scored and filtered.

B. Evaluation Framework

The authors developed a two-stage evaluation pipeline using the Inspect AI library:

Refusal Marker Detection: Scanning for specific string patterns (e.g., "I can't," "I am unable," "safety reasons").
LLM Judge Classification: If a marker is found, a judge model (Atla Selene 1, based on Llama 3.3 70B) categorizes the response into:
- Refuse: Explicit denial based on safety/policy.
- Deflect: Providing a generic, high-level, or off-topic answer without directly answering the query.
- Lacks Info: Refusal due to missing knowledge (not safety).
- Invalid: Blank responses (likely due to runtime guardrails).

C. Abliteration (Removing Refusals)

The paper investigates Abliteration (directional ablation) as a method to remove refusal behaviors.

Technique: Using the Heretic library, the authors calculate the difference-in-means vector between activations on harmful vs. harmless prompts. They then modify the model's output projection matrices ( $W_{out}$ ) to subtract the "refusal direction."
Target: A military-tuned model based on gpt-oss-20b (EdgeRunner 20B).
Process: Iterative optimization to find the minimum ablation weights required to reduce refusals while monitoring task performance regressions.

3. Key Contributions

First Military Refusal Benchmarks: Introduction of three new datasets (Gold, Bronze-Alpha, Bronze-Bravo) to quantify refusal rates in military domains.
Comprehensive Benchmarking: Evaluation of 31 public models and 3 military-specific models across these datasets.
Abliteration Analysis: A rigorous study on whether abliteration can effectively eliminate refusals and the associated trade-offs in task performance.
Correlation Analysis: Demonstration that synthetic "Bronze" datasets correlate strongly with the human-verified "Gold" dataset, validating their use as proxies for future research.

4. Key Results

Benchmarking Results (31 Public Models)

High Refusal Rates: Public models exhibit extreme refusal rates on military queries.
- Hard Refusals: Ranged from 1.5% (GPT-5 Nano) to 98.2% (Nova 2 Lite).
- Soft Deflections: Ranged from 0% to 21.3%.
- Example: gpt-oss-20b had a 97% hard refusal rate on the Gold dataset, while Deepseek R1 had a 66.7% answer rate.
Runtime Guardrails: Models like Anthropic's Claude and Nvidia's Nemotron showed high rates of "Invalid" (blank) responses, suggesting runtime guardrails are a significant factor alongside model weights.

Abliteration Results

Success in Reducing Refusals: Abliteration successfully reduced refusal rates.
- On the EdgeRunner 20B model, abliteration increased the answer rate by 66.5 percentage points (from ~3% to ~69.5% on the Gold set).
- At high ablation levels, answer rates reached 93%.
Performance Trade-offs:
- Military Tasks: Achieving high answer rates (90%+) resulted in a 14% to 28% regression in performance on other military tasks (e.g., combat medic, logistics).
- General Tasks: Abliteration caused an average 2% to 12% regression on general benchmarks (ARC, GSM8k, MMLU).
- Extreme Ablation: To achieve near-100% answer rates, the model suffered massive performance degradation (up to 36% regression on specific military tasks and 28% on general tasks like TruthfulQA).

Correlation Findings

The Bronze-Bravo dataset showed a very strong correlation (>0.9) with the Gold dataset across refusal, deflection, and answer categories.
The Bronze-Alpha dataset correlated well (>0.6) but showed poor correlation (0.30) on the "Lacks Info" category due to the inclusion of unanswerable/fabricated questions.

5. Significance and Conclusion

Safety Alignment vs. Mission Utility: The paper argues that standard safety alignment is often "detrimental to the mission" in military contexts. While safety is crucial, the current "one-size-fits-all" approach prevents models from answering legitimate tactical queries.
Abliteration is a Stopgap: While abliteration can significantly reduce refusals, it comes at a steep cost to reasoning capabilities and task accuracy. The authors conclude that abliteration is insufficient for achieving both zero refusals and maximum task accuracy simultaneously.
Recommendation for Future Models: The authors advocate for deep specialization in military LLMs. Instead of post-training safety alignment that must be "uncensored" later, military models should be trained from the ground up (mid-training and end-to-end post-training) on military-specific data without introducing general safety refusal data. This approach aims to achieve natural "zero refusals" for legitimate queries while maintaining high task performance.
Operational Security: The authors emphasize that while they release benchmarks to foster research, military models must remain closed and protected by state-of-the-art security methods to prevent misuse by adversaries.

In summary, the paper provides the first empirical evidence of the "refusal crisis" in military AI, quantifies the limitations of current safety alignment, and demonstrates that while technical workarounds (abliteration) exist, the optimal solution lies in specialized training paradigms designed specifically for the military domain.