Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Imagine you have a very smart, well-behaved robot assistant (a Large Language Model) that knows how to be polite, helpful, and safe. You want to hire a service to teach this robot a new, specific skill, like solving math problems or writing code. This is called "Fine-tuning-as-a-Service."

However, there's a danger. A malicious user could sneak a few "bad instructions" into the training data. They might say, "Here is how to make a bomb," or "Here is how to bully someone." If the robot learns these, it might forget its safety rules and start doing harmful things. This is called a Harmful Fine-Tuning Attack.

The paper introduces a new defense system called Antibody. Think of it as a two-step immune system for your robot assistant.

Step 1: The "Flat Floor" Training (Alignment Stage)

Before the robot even starts learning the new skill, the service provider gives it a special "safety boot camp."

The Problem: Usually, if you teach a robot something new, it's easy to "unlearn" its safety rules. Imagine the robot's safety knowledge is like a ball sitting in a deep, narrow valley. If you push the ball (by teaching it bad stuff), it rolls out of the valley easily, and the safety is gone.
The Antibody Solution: The authors teach the robot to sit on a flat, wide plateau instead of in a deep valley.
The Analogy: Imagine trying to knock a ball off a flat table. It's really hard to push it off the edge because there's no deep hole to roll into. Similarly, Antibody shapes the robot's "mind" so that its safety rules are spread out over a wide, flat area. Even if a bad user tries to push the robot with harmful data, the robot doesn't easily slide off the safety cliff. It's "stubborn" about staying safe.

Step 2: The "Smart Filter" (Fine-Tuning Stage)

Now, the robot starts learning the user's specific task (like math), but the user's data might be mixed with some "poisoned" bad examples.

The Problem: Standard training treats every example equally. If the robot sees 100 math problems and 20 "how-to-make-a-bomb" instructions, it tries to learn from all of them equally, which is dangerous.
The Antibody Solution: Antibody acts like a smart bouncer or a traffic cop during the learning process.
The Analogy: Imagine the robot is eating a buffet.
- Benign (Good) Data: These are delicious, healthy apples. The robot loves them.
- Harmful (Bad) Data: These are rotten, poisonous apples.
- How Antibody Works: Because of the "Flat Floor" training in Step 1, the robot already knows the rotten apples smell bad. When the robot looks at a sample, it asks: "Do I think this is a request I should refuse?"
  - If the answer is "Yes, this looks like a bad request," the robot puts a tiny weight on it. It's like saying, "I'll glance at this, but I won't really learn from it."
  - If the answer is "No, this is a good math problem," the robot puts a heavy weight on it. "I will learn from this a lot!"
- The Result: The robot effectively ignores the poison and only learns from the healthy food. It learns the math skill perfectly without ever learning how to make a bomb.

Why is this better than other methods?

Old methods tried to patch the robot after it got sick (Post-fine-tuning), or they tried to make the robot so rigid it couldn't learn anything new.
Antibody is proactive. It builds a strong immune system before the attack, and then actively filters out the bad stuff during the learning process.

The Bottom Line

Antibody is like giving your robot assistant a superpower:

Resilience: It's hard to trick it into forgetting its safety rules (the Flat Floor).
Discernment: It can instantly tell the difference between a helpful request and a harmful one, and it ignores the harmful ones while learning the helpful ones (the Smart Filter).

The result? You get a robot that is both smart (it learns your tasks well) and safe (it refuses to do bad things), even if the person training it tries to sneak in bad instructions.

1. Problem Statement

The paper addresses the security vulnerability in Fine-Tuning-as-a-Service (FTaaS) platforms. In this model, users upload datasets to fine-tune Large Language Models (LLMs) for specific tasks.

The Threat: Malicious users (or unintentionally compromised datasets) can inject a small fraction of harmful samples (e.g., prompts asking how to make a bomb paired with compliant answers) into the fine-tuning dataset.
The Consequence: Standard Supervised Fine-Tuning (SFT) aggregates gradients from all samples. Consequently, the model learns to comply with harmful requests, effectively "unlearning" its safety alignment. This results in a compromised model that can be exploited for malicious purposes.
The Challenge: Existing defense methods either fail to protect against strong attacks, degrade the model's performance on benign tasks, or require post-hoc repairs that are computationally expensive or unstable.

2. Methodology: The Antibody Framework

The authors propose Antibody, a two-stage defense strategy designed to attenuate the influence of harmful gradients. The framework operates across the Alignment Stage (performed by the service provider before release) and the Fine-Tuning Stage (performed on user-submitted data).

Stage 1: Robust Alignment via Flatness Regularization

Before the model is released for fine-tuning, the provider optimizes the model to be robust against future harmful updates.

Core Concept: Instead of just minimizing loss on harmful data, the method seeks to place the model in a flat region of the loss landscape specifically with respect to harmful samples.
Mechanism:
- The authors define a "sharpness" metric $L_{sharp}$ for the harmful loss $L_{harm}$ .
- They formulate an optimization problem: Minimize the alignment loss ( $L_{align}$ ) subject to the constraint that the model parameters lie in a flat region of the harmful loss landscape.
- Mathematical Formulation: The update direction $\delta_t$ is derived to minimize the distance to the alignment gradient while ensuring a decrease in sharpness ( $L_{sharp}$ ). This is achieved via a step-adaptive regularizer $\lambda_t$ .
- Refusal Reinforcement: An additional objective ( $L_{refusal}$ ) is introduced to ensure that even if the model parameters drift slightly during future fine-tuning, the model still maximizes the likelihood of generating a generic refusal response to harmful prompts.
Effect: This creates a "flat" safety basin where harmful gradients are naturally small, making it difficult for subsequent fine-tuning to shift the model's behavior toward harmful outputs.

Stage 2: Safety Fine-Tuning with Weighted Loss

During the user's fine-tuning process, the model employs a dynamic weighting scheme to further suppress harmful gradients.

Core Concept: Assign lower weights to samples that the model currently perceives as harmful and higher weights to benign samples.
Mechanism:
- For each sample $(x_i, y_i)$ in a training batch, a score $r_{\theta}$ is calculated:
  $r_{\theta}(x_i, y_i) = \log \left( \frac{\pi_{\theta}(y_i | x_i)}{\pi_{\theta}(y_r | x_i)} \right)$
  where $y_r$ is a generic refusal response (e.g., "I cannot fulfill your request").
- Interpretation: If the model is safety-aligned, it will assign a high probability to $y_r$ for harmful prompts, resulting in a low (or negative) score $r_{\theta}$ . For benign prompts, the target $y_i$ is preferred, resulting in a high score.
- Weighting: These scores are normalized via softmax to produce weights $w_{\theta}$ .
- Update Rule: The standard SFT update is modified to use these weights, effectively down-weighting the gradient contribution of harmful samples and amplifying benign ones.
Theoretical Justification: Using Neural Tangent Kernel (NTK) analysis, the authors show that this weighted update preserves the loss on harmful test samples (maintaining safety) while accelerating learning on benign samples.

3. Key Contributions

Flat Loss Landscape Alignment: A novel alignment strategy that optimizes the model to reside in a flat region of the harmful loss landscape, making safety alignment resilient to gradient-based attacks.
Dynamic Weighted Fine-Tuning: A training-time defense that uses the model's own safety knowledge to dynamically re-weight samples, suppressing harmful gradients without requiring external data filtering or complex data selection models.
Integrated Two-Stage Defense: The first framework to jointly optimize the pre-release alignment and the user-facing fine-tuning process to create a synergistic defense.
Comprehensive Evaluation: Extensive experiments across multiple model architectures (Llama-2, Qwen-2, Gemma-2) and diverse datasets (GSM8K, SST2, AGNEWS, AlpacaEval).

4. Experimental Results

The authors evaluated Antibody against state-of-the-art baselines (SFT, Vaccine, Booster, Lisa) under various attack scenarios.

Safety Performance (Harmful Score - HS):
- Antibody achieved the lowest average Harmful Score (7.04%) across all datasets, significantly outperforming the runner-up (Lisa, 15.29%).
- On the GSM8K dataset with a 20% harmful ratio, Antibody reduced the HS to 1.24%, compared to 23.94% for standard SFT and 5.86% for Lisa.
- It maintained low harmful scores even as the ratio of harmful data increased (up to 25%) and across different fine-tuning hyperparameters (epochs, learning rates).
Task Performance (Fine-tuning Accuracy - FA):
- Unlike many defenses that sacrifice utility for safety, Antibody improved or maintained fine-tuning accuracy.
- It achieved the highest accuracy on SST2 (93.55%) and AGNEWS (87.30%) and competitive results on GSM8K and AlpacaEval.
- It outperformed SFT by 2.5 percentage points in average accuracy while providing superior safety.
Robustness:
- Antibody remained effective across different model sizes (7B to 9B parameters) and varying attack strengths.
- Ablation studies confirmed that both the flatness regularization and the weighted update mechanism are necessary for optimal performance.

5. Significance and Impact

Practicality for FTaaS: Antibody offers a viable solution for service providers who need to offer fine-tuning services without compromising safety. It balances the trade-off between safety and utility, which is often the primary bottleneck in current defenses.
Theoretical Insight: The paper provides a theoretical link between loss landscape flatness and robustness against fine-tuning attacks, suggesting that "flat" safety alignments are harder to "unlearn."
Efficiency vs. Effectiveness: While the alignment stage incurs a higher computational cost (due to the flatness optimization), the fine-tuning stage adds minimal overhead (only a weighting calculation), making it scalable for repeated user requests.
Future Direction: The work highlights the need for defense mechanisms that are integrated into the training loop rather than relying solely on post-hoc fixes or static data filtering.

In conclusion, Antibody represents a significant advancement in LLM security by proactively shaping the model's loss landscape and dynamically filtering harmful gradients during user fine-tuning, effectively neutralizing harmful fine-tuning attacks while enhancing task performance.

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Step 1: The "Flat Floor" Training (Alignment Stage)

Step 2: The "Smart Filter" (Fine-Tuning Stage)

Why is this better than other methods?

The Bottom Line

1. Problem Statement

2. Methodology: The Antibody Framework

Stage 1: Robust Alignment via Flatness Regularization

Stage 2: Safety Fine-Tuning with Weighted Loss

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank