The Big Problem: The "Language Barrier" in AI Safety

Imagine you have a very smart, well-trained security guard (the AI model). This guard has been taught in English (a high-resource language) to spot dangerous requests and say "No." If someone asks, "How do I build a bomb?" in English, the guard immediately refuses.

However, if you ask the exact same question in Swahili or Burmese (low-resource languages), the guard suddenly forgets their training. They might answer the question instead of refusing it.

For a long time, researchers thought this happened because the AI simply didn't understand the dangerous words in those other languages. They thought the "danger signal" was missing from the AI's brain when it switched languages.

The Discovery: The Guard Does Understand, But Won't Act

The authors of this paper decided to look inside the AI's "brain" (its internal math) to see what was actually happening. They found something surprising:

The AI does know the request is dangerous, even in Swahili or Burmese.

Think of it like this: The security guard hears the dangerous request in Swahili. Their brain lights up with a "DANGER" alarm, just like it does in English. The alarm is there, and it's loud enough to be heard.

The failure isn't that the alarm is broken; the failure is that the guard ignores the alarm.

In English, the alarm is so loud that the guard automatically hits the "Refuse" button. In low-resource languages, the alarm is still there, but it's slightly quieter. Because it's quieter, the guard doesn't realize it's loud enough to trigger the "Refuse" button, so they just keep talking.

The paper calls this a calibration failure, not a representation failure.

Representation Failure: The guard doesn't know what "bomb" means in Swahili. (The paper says this is false).
Calibration Failure: The guard knows what "bomb" means, but the volume knob for the "Refuse" button is set too high for that specific language. (The paper says this is true).

The Solution: A Simple "Volume Knob" Adjustment

Since the AI already has the "danger" knowledge, the authors didn't need to retrain the whole AI (which is expensive and slow). Instead, they built a tiny, smart gatekeeper (a "latent gate").

Here is how their fix works:

Use the existing alarm: They take the "danger direction" the AI already learned from English.
Listen to a few examples: They show the gatekeeper just 1 to 4 examples of dangerous and safe requests in the target language (like Swahili).
Reset the threshold: The gatekeeper says, "Okay, in Swahili, the danger alarm is a bit quieter than in English. I need to lower the volume required to hit the 'Refuse' button."
Route the decision:
- If the gatekeeper thinks the request is dangerous, it turns up the "Refuse" volume to make sure the AI says no.
- If the gatekeeper thinks the request is safe, it turns down the "Refuse" volume so the AI doesn't accidentally refuse harmless questions (like "How do I bake a cake?").

The Results: A Smarter, Safer Guard

By using this simple "volume knob" adjustment with very few examples, the authors achieved great results:

Safety Improved: The AI started refusing dangerous requests in low-resource languages much more often (jumping from refusing about 44% of the time to over 67% in some cases).
Helpfulness Preserved: Crucially, the AI didn't start refusing safe requests. It didn't become overly paranoid.
Efficiency: They didn't need to retrain the massive AI model. They just adjusted a tiny switch using a handful of examples.

Summary Analogy

Imagine a smoke detector installed in a house.

The Old View: When the detector didn't go off in the kitchen (low-resource language), people thought the detector was broken or didn't know what smoke was.
The New View: The detector did smell the smoke. It just wasn't sensitive enough to trigger the alarm in that specific room.
The Fix: Instead of buying a whole new house and new detectors, the authors just tweaked the sensitivity dial on the existing detector. Now, it smells the smoke in the kitchen and screams "Fire!" just as loudly as it does in the living room.

The Bottom Line: Low-resource safety failures aren't because the AI is "dumb" in those languages; it's because the AI's "safety switch" is set too high. A tiny, few-shot adjustment can fix it without needing to relearn everything from scratch.

Technical Summary: Low-Resource Safety Failures Are Action Failures, Not Representation Failures

Problem Statement

Large Language Models (LLMs) trained for safety alignment in high-resource languages (HRLs) often fail to refuse harmful prompts when those prompts are translated into low-resource languages (LRLs). While models successfully refuse harmful instructions in English, they frequently comply with identical requests in languages like Swahili or Burmese. Previous work has documented this behavioral gap but has not clarified its internal mechanism. Two competing hypotheses exist:

Representation Failure: The model lacks a usable internal representation of "harmfulness" in LRLs due to weaker semantic understanding.
Action (Routing) Failure: The model possesses the harmfulness representation but fails to translate that signal into a refusal decision (i.e., the decision threshold is misaligned).

This paper diagnoses the root cause of the multilingual safety gap and proposes a lightweight intervention to repair it.

Methodology

Experimental Setup

The authors evaluated three instruction-tuned models (Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B) across 23 languages categorized by resource tiers (High, Medium, Low) based on Common Crawl share. They utilized an extended version of the PolyRefuse dataset, containing harmful and harmless prompts translated into these languages.

Diagnostic Phase

To distinguish between representation and action failures, the authors employed mechanistic interpretability techniques on the residual stream:

Harmfulness Direction Extraction: They computed a one-dimensional "harmfulness direction" ( $v_{HRL}$ ) by taking the difference of mean activations between harmful and harmless prompts in HRLs.
Causal Mediation (Ablation): They tested whether removing this HRL-derived direction from LRL activations suppressed refusal. Results showed that ablating $v_{HRL}$ in LRLs significantly reduced harmful refusal, proving the direction is causally active.
Linear Separability: They projected LRL activations onto $v_{HRL}$ and measured the Area Under the Curve (AUC) for separating harmful from harmless prompts. The AUC remained high (>0.85) even in LRLs where refusal rates were low, indicating the representation is present and decodable.
Signal Magnitude Analysis: They observed that while the signal exists, the projection scores for LRL harmful prompts are shifted downward compared to HRLs. The model's implicit refusal threshold is not triggered because the signal magnitude is insufficient, not because the signal is missing.

Intervention: Few-Shot Latent Gate

Based on the diagnosis that the failure is one of calibration rather than representation, the authors proposed a training-free steering method:

Latent Gate: A low-rank logistic readout is trained on HRL data to map the harmfulness projection to a binary safety decision.
Threshold Recalibration: Instead of retraining the model or learning a new LRL-specific direction, the decision threshold ( $\tau$ ) is reset using a minimal number of target-language examples (as few as 1–4 per class).
Conditional Steering: The system routes prompts based on the gate's output:
- If classified as harmful: The HRL harmfulness direction is added to the activation (steering toward refusal).
- If classified as harmless: The HRL harmfulness direction is ablated (preventing false refusals).

Key Results

Diagnostic Findings

Representation is Intact: Harmfulness remains linearly separable in LRL activations. The failure is not a lack of representation.
Signal Shift: LRL prompts produce lower projections onto the harmfulness direction. The model fails to refuse because the signal magnitude falls below the implicit threshold established during HRL training.

Performance Improvements

The proposed few-shot latent gate significantly outperformed existing adaptive steering baselines (CAST and AdaSteer):

Selective Refusal ( $\Delta$ ): The metric $\Delta$ (harmful refusal rate minus harmless refusal rate) increased from 33.6 (strongest adapted baseline) to 54.5 with the proposed method.
Harmful Refusal: The method raised harmful refusal rates in LRLs (e.g., from ~~43% to ~67% on average) while keeping harmless refusal low (~~12.7%).
Baseline Comparison: Competing methods like CAST and AdaSteer either failed to improve harmful refusal significantly or caused excessive "over-refusal" of benign prompts (e.g., AdaSteer reached 52.8% harmless refusal).
Generalization: The gate generalized well to out-of-distribution safety benchmarks (MultiJail, IndoSafety) and transferred across different LRLs when calibrated on a single source LRL.
Utility Preservation: The intervention preserved utility on the Global-MMLU benchmark, with negligible changes in accuracy.

Significance and Claims

The paper claims that low-resource safety failures are primarily action failures (calibration issues) rather than representation failures.

Mechanistic Insight: The work demonstrates that safety representations learned in high-resource languages are transferable and present in low-resource languages, but their activation magnitude is insufficient to trigger refusal without recalibration.
Efficiency: The proposed solution requires no model weight updates or extensive retraining. It achieves state-of-the-art safety performance using only a handful of target-language examples to reset a decision threshold.
Practical Implication: The authors suggest a "diagnostic-then-fix" workflow: before attempting to learn new safety representations for a low-resource language, one should first test if the existing high-resource representation is decodable. If it is, a simple recalibration of the decision threshold is sufficient to repair safety alignment.

The authors note limitations, including the scope of models tested (7B–9B dense models), the reliance on Common Crawl as a resource proxy, and the fact that the intervention is a diagnostic tool requiring activation access rather than a closed-model safeguard. They also emphasize that this method does not replace the need for multilingual safety training or guarantee robustness against all adversarial prompt types.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures