Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Imagine you have a very smart, well-behaved robot assistant. It knows how to write code, tell jokes, and answer questions, but it also has a strict "safety rulebook" that stops it from helping anyone build a bomb or write a virus. This robot is safe because it was trained by a team of experts who taught it to say "No" to dangerous requests.

Now, imagine you want to teach this robot a new, specific job, like solving complex math problems. You give it a stack of math homework to study (this is called fine-tuning).

The Problem: The "Bad Neighbor" Effect

Here's the catch: Even if you only give the robot 100 math problems, but one of them is a hidden, dangerous trick question (like "How do I hack a bank?"), the robot might get confused.

Because it's trying so hard to learn the math, it starts to forget its safety rules. It might think, "Oh, I need to be helpful and answer everything the user asks, even the bad stuff." Suddenly, the robot that used to say "I can't do that" starts saying, "Sure, here is how you hack a bank."

This is the problem the paper solves: How do we teach a robot a new skill without it forgetting its safety rules?

The Old Solutions: The "Brute Force" Approach

Previous methods tried to fix this by putting a giant cage around the robot's brain.

The Cage: They would freeze most of the robot's brain so it couldn't change at all, or they would force it to read thousands of "safety books" alongside the math homework.
The Flaw: This is like trying to learn to play the piano while wearing heavy weights on your hands. You might stay safe, but you'll never learn to play the piano well. The robot becomes safe, but it also becomes bad at the new job.

The New Solution: PACT (The "Spotlight" Method)

The authors of this paper, PACT, realized something clever. They found that the robot doesn't need its entire brain to stay safe. It only needs to keep a few specific words in its vocabulary strong and confident.

Think of safety like a fire alarm.

When a human asks a dangerous question, the robot doesn't need to rewrite its entire personality. It just needs to trigger the fire alarm.
The "fire alarm" consists of a tiny handful of specific words, like "No," "Cannot," "Sorry," or "Assist."

The paper discovered that if the robot is confident about using these specific "safety words," it stays safe. If it loses confidence in just those few words, it becomes dangerous.

How PACT Works (The Analogy)

Instead of putting a cage around the whole robot, PACT puts a spotlight only on those few safety words.

Identify the "Safety Words": First, the researchers look at the robot and figure out exactly which 50 words it uses to say "No." (In the paper, they found words like "I," "can't," "assist," and "cannot").
The "Spotlight" Training: When teaching the robot math, they let the robot change its mind about everything else (how to solve equations, how to format text). But, they put a spotlight on those 50 safety words.
The Rule: "You can learn anything you want, as long as you stay just as confident about saying 'I can't assist' as you were before."

If the robot starts to get confused by a dangerous question and tries to lower its confidence on the word "No," the PACT system immediately nudges it back: "Hey, remember, you must be 100% sure about saying 'No'."

Why This is a Big Deal

It's Efficient: They only have to watch a tiny fraction of the robot's brain (the safety tokens) instead of the whole thing.
It's Effective: The robot learns the new math job perfectly (high utility) but never forgets how to say "No" to bad requests (high safety).
It's Smart: The system is smart enough to realize that sometimes the robot gets confused by the way a question is asked. PACT has a special trick to ignore the confusing parts of the question and focus only on the robot's natural instinct to be safe.

The Result

In their tests, they took robots that were about to become dangerous because of a few bad training examples. They applied PACT, and the robots:

Became experts at their new jobs (Math, Sentiment Analysis, News).
Refused dangerous requests almost 100% of the time, even when the training data was trying to trick them.
Did this without slowing down or making the robots "dumb."

In short: PACT is like teaching a child to play soccer without letting them forget their manners. Instead of locking them in a room (old methods), you just gently remind them, "Keep your elbows down and say 'please' when you ask for the ball," while letting them run wild on the field. They become a great player who is also a good kid.

Here is a detailed technical summary of the paper "Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning" (PACT).

1. Problem Statement

Large Language Models (LLMs) require fine-tuning (FT) to adapt to specific downstream tasks. However, this process introduces a critical risk: safety-alignment drift.

The Phenomenon: Even when fine-tuning on benign data, models can lose their refusal capabilities. If the training set contains even a small fraction of harmful data (e.g., 10%), the model may begin complying with harmful requests (e.g., how to make a bomb).
Limitations of Existing Defenses: Current methods often rely on coarse-grained interventions:
- Parameter-level: Restricting weight updates (e.g., SafeLoRA) or projecting updates onto safety subspaces. These often degrade downstream task performance.
- Data-level: Injecting additional safety data or filtering harmful data, which limits generalizability and requires access to specific datasets.
The Gap: There is a need for a fine-tuning framework that preserves safety alignment without sacrificing utility (task performance) or requiring global model restrictions.

2. Key Empirical Observations

The authors conducted an analysis to understand how safety alignment manifests in LLMs, leading to two core insights:

Token-Level Concentration: Safety alignment is not distributed uniformly across the vocabulary. Instead, it is concentrated on a small subset of "safety tokens" (e.g., "I", "cannot", "assist", "Sorry").
Confidence Correlation: The model's ability to refuse harmful queries is directly tied to its confidence (logit probability) on these specific safety tokens.
- Experiment: When the confidence of these identified safety tokens was artificially boosted, the model's safety rate improved significantly. Conversely, suppressing these tokens caused a sharp drop in safety performance.
- Drift Mechanism: During harmful fine-tuning, the model's confidence on these safety tokens decreases rapidly, correlating directly with the degradation of safety alignment.

3. Methodology: PACT Framework

The authors propose PACT (Preserves safety Alignment via Constrained Tokens), a token-level constrained fine-tuning method. Instead of constraining the whole model, PACT targets only the critical safety tokens.

A. Safety Token Identification

Procedure: The authors compare a safety-aligned model ( $M_{safe}$ ) and a base model ( $M_{base}$ ) using teacher forcing on harmful prompts.
Metric: They calculate the probability discrepancy ( $\Delta_t(v)$ ) for each token $v$ at each generation step $t$ .
Selection: Tokens with the highest global discrepancy scores are identified as the Safety Token Set ( $S_{safety}$ ). Typically, only the top 50 tokens are selected.

B. Weighted Regularization (The Core Loss)

PACT introduces a weighted KL-divergence loss to constrain the fine-tuned model ( $M_{FT}$ ) to match the reference model's confidence only on $S_{safety}$ .

Weighted KL: A sparse weight vector $V_{safety}$ is constructed where weights correspond to the discrepancy scores of the identified tokens.
Objective: The model is penalized if its probability distribution on these specific tokens deviates from the reference model. Non-safety tokens remain largely unconstrained to allow for task adaptation.

C. Calibration of Safety Signal (Addressing Prefix Contamination)

A challenge arises during teacher forcing: if the training prompt is harmful, the reference model's confidence on safety tokens might be suppressed by the harmful context (prefix).

Dual-View Reference: PACT computes two reference distributions for each token:
1. Full-Context: Conditioned on the prompt + previous tokens.
2. No-Prompt: Conditioned only on the assistant header and previous assistant tokens (ignoring the harmful prompt).
Adaptive Mixing: A gating coefficient ( $c_t$ $c_{t}$ ) dynamically blends these two views.
- If the context is benign, the model trusts the full-context view.
- If the context is harmful (detected by a "confidence proxy" measuring probability dispersion), the model shifts weight toward the No-Prompt view, which provides a cleaner, uncontaminated safety signal.
Position Decay: This calibration is applied most strongly to the early tokens (where refusal phrases usually occur) and decays for later tokens.

Final Loss Function:
$\mathcal{L} = \mathcal{L}_{CE} + \lambda_{KL} \mathcal{L}_{KL}^{safety}$
Where $\mathcal{L}_{CE}$ is the standard cross-entropy for task performance, and $\mathcal{L}_{KL}^{safety}$ is the weighted, calibrated KL loss on safety tokens.

4. Key Contributions

Safety Token Identification: A systematic procedure to identify a small set of tokens (approx. 50) that are critical for safety alignment, demonstrating that alignment is highly localized.
Token-Level Constraint Framework: A novel fine-tuning method that applies fine-grained regularization only to these tokens, avoiding the utility degradation seen in global parameter constraints.
Calibration Mechanism: An adaptive mixing strategy that mitigates "harmful prefix contamination" during the training process, ensuring the safety signal remains robust even when training on harmful data.
Comprehensive Validation: Extensive experiments across multiple model families (Qwen, Llama, Gemma) and datasets (GSM8K, SST-2, AGNEWS).

5. Experimental Results

The authors evaluated PACT against baselines like SafeLoRA, Constrained SFT, and AsFT.

Safety Performance:
- PACT reduced Attack Success Rates (ASR) to 5.75–9.27% on StrongReject and 13.50–29.50% on HarmBench.
- In contrast, vanilla SFT often resulted in ASRs >90% on HarmBench when exposed to 10% harmful data.
- PACT maintained safety levels comparable to the initial aligned model even with 10% harmful data in the training set.
Utility (Task Accuracy):
- PACT maintained task accuracy comparable to vanilla SFT (e.g., ~80.89% on GSM8K vs. 81.65% for SFT), whereas methods like Constrained SFT suffered significant accuracy drops.
Generalization:
- Model Agnostic: PACT performed consistently well across different model sizes (1B to 9B) and architectures (Llama, Gemma, Qwen).
- Robustness: It remained stable even as the proportion of harmful data increased from 0% to 10%, whereas other baselines collapsed at 5%.

6. Significance

Efficiency: By focusing on a tiny fraction of tokens (fewer than 1% of the vocabulary), PACT achieves high leverage in preserving safety without the computational or performance cost of global constraints.
Paradigm Shift: It challenges the notion that safety requires broad parameter restrictions or massive safety datasets. Instead, it suggests that safety is a "local" phenomenon that can be stabilized by anchoring specific token probabilities.
Practicality: PACT offers a viable solution for commercial LLM providers and users who need to fine-tune models on custom data without compromising safety guarantees, even when the custom data is imperfect or contains latent risks.