Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning

The Big Problem: The "Fake Forget"

Imagine you have a giant, super-smart library (a Large Language Model) that has read almost everything on the internet. Sometimes, it accidentally memorizes private secrets, like your home address or a celebrity's birthday.

To fix this, developers try to use "unlearning" methods to make the library forget these specific secrets. They think they are erasing the books from the shelves.

But here is the catch: The paper argues that most current methods aren't actually erasing the books. Instead, they are just hiding them behind a heavy curtain.

The library still has the secret knowledge deep inside, but it has learned to put up a "Do Not Enter" sign (a Spurious Unlearning Neuron) that blocks anyone from asking about it. As long as the curtain stays up, the secret seems gone. But if the curtain is moved, or if the library gets a little bit of training on new topics, the secret books pop right back out.

The Discovery: "Spurious Unlearning Neurons"

The authors discovered that when we try to make an AI forget something, it doesn't delete the memory. Instead, it creates a new, fake neuron that acts like a security guard.

The Old Way (Shallow Alignment): The AI says, "I know the answer, but I'm going to pretend I don't." It creates a negative signal to suppress the answer.
The Result: The original memory is still there, intact. The security guard is just standing in front of it.

The Analogy: Imagine you want to forget a embarrassing song you used to love.

True Erasure: You delete the MP3 file from your phone. It's gone forever.
The Paper's "Fake Forget": You don't delete the file. Instead, you install a loud alarm system that screams "NO!" every time you try to play it. The file is still there. If someone turns off the alarm (by retraining the model), the song plays immediately.

The Test: The "Retraining Attack"

To prove this, the authors set up two scenarios to see if the "curtain" would fall:

The "Harmful" Attack (The Sneaky Re-learner): Imagine a bad actor takes the "forgotten" AI and feeds it a tiny bit of the private data again (like showing it the secret address one more time).
- Result: Because the secret was never truly deleted, the AI instantly remembers everything. The "security guard" is easily bypassed.
The "Benign" Attack (The Accidental Re-learner): Imagine a normal user takes the AI and trains it on a generic dataset (like a list of instructions on how to bake a cake).
- Result: Even this innocent training accidentally knocks down the "curtain." The AI starts spitting out the private secrets again because the underlying memory was never removed.

The Solution: SSIUU (The "True Eraser")

The authors propose a new method called SSIUU (Suppressing Spurious Unlearning Neurons for Robust Unlearning).

Instead of letting the AI build a "security guard" to hide the secret, SSIUU forces the AI to physically tear out the book from the shelf.

How it works: It uses a special tool (called "attribution") to look inside the AI's brain. It sees which neurons are holding the secret and which neurons are acting as the "security guards."
The Fix: It applies a rule that says, "Don't create new negative signals to hide things. Instead, gently reduce the strength of the neurons that actually hold the secret."
The Metaphor: Instead of putting a "Do Not Enter" sign on the door, SSIUU removes the door entirely.

Why This Matters

The paper concludes that for AI to be truly safe and private, we can't just rely on methods that "hide" bad data. We need methods that faithfully erase it.

If we don't fix this, open-source AI models (which anyone can download and tweak) could be easily tricked into revealing private information that we thought was gone. SSIUU offers a way to ensure that when an AI "forgets," it really, truly forgets.

Summary in One Sentence

Current AI "forgetting" methods are like putting a blindfold on a person who still remembers everything; the new method (SSIUU) actually removes the memory so that even if the blindfold falls off, the person still doesn't know the secret.

1. Problem Statement

Large Language Models (LLMs) trained on web-scale data often memorize private or sensitive information, posing significant privacy risks. While "machine unlearning" methods exist to remove this knowledge, the paper identifies a critical flaw: Shallow Unlearning Alignment.

The Core Issue: Existing unlearning methods (e.g., Gradient Ascent, DPO) often fail to truly erase target knowledge. Instead, they induce a state where the model generates Spurious Unlearning Neurons.
Mechanism of Failure: Rather than diminishing the influence of neurons that encode the sensitive information, these methods introduce new neurons that act as inhibitors (suppressors). The original knowledge-bearing neurons remain intact but are "hidden" by these negative influencers.
Vulnerability: Because the original knowledge is not erased, it can resurface if the spurious neurons are disrupted or bypassed during subsequent training (retraining). This makes current unlearning fragile against both adversarial attacks (retraining with private data) and benign attacks (retraining with general instruction-following data).

2. Methodology: SSIUU

To address this, the authors propose SSIUU (Suppressing Spurious Unlearning Neurons for Robust Unlearning). The method employs attribution-guided regularization to ensure knowledge is erased rather than suppressed.

Key Technical Components:

Attribution Analysis:
- The authors use an attribution method to quantify the contribution of specific neurons to target knowledge.
- They distinguish between Positive Influence (neurons that help generate the target output) and Negative Influence (neurons that suppress the target output).
- Hypothesis: Successful unlearning requires a decrease in Positive Influence ( $D^+$ ). The emergence of Spurious Neurons is characterized by a significant increase in Negative Influence ( $D^-$ ) while Positive Influence remains high.
Regularization Objective:
- SSIUU modifies the standard unlearning loss function ( $L_{\theta_t}$ ) by adding a regularization term that constrains the growth of negative attribution.
- Formula:
  $\arg \min_{\theta_t} L_{\theta_t} + \lambda \sum_{i \in I^-} \sum_{(x,y) \in C_f} ||A^{(x,y)}_{\theta_{t-1}, i} - A^{(x,y)}_{\theta_t, i}||^2$
- Logic: The term minimizes the gap between the negative attribution of the previous step ( $\theta_{t-1}$ ) and the current step ( $\theta_t$ ) for neurons with negative scores ( $I^-$ ).
- Goal: This prevents the model from "inventing" new negative influencers. It forces the model to reduce the positive influence (erasing the knowledge) while keeping the original negative influence stable, thereby avoiding the creation of spurious neurons.

3. Experimental Setup

Models: Llama-3.2 (3B) and Qwen-2.5 (3B).
Datasets:
- FaithUn: Real-world entity knowledge (e.g., celebrity facts).
- TOFU: Synthetic author profiles.
Baselines: Gradient Ascent (GA), Gradient Difference (GD), Direct Preference Optimization (DPO), NPO, RMU, and KLUE.
Attack Scenarios (Robustness Evaluation):
1. Harmful Attack: Retraining the unlearned model on a small subset ( $p=0.1, 0.3$ ) of the forget set (private data).
2. Benign Attack: Retraining on a general instruction-following dataset (Alpaca) to simulate standard fine-tuning.

4. Key Results

A. Robustness Against Retraining

SSIUU significantly outperforms baselines in preventing knowledge recovery:

Harmful Attack ( $p=0.1$ ): On Llama-3.2 (FaithUn), baselines like GA and GD saw accuracy on the forget set recover to 68.42% and 48.13%, respectively. SSIUU kept recovery at 14.81%.
Benign Attack: Baselines showed significant recovery (e.g., GA at 16.71%, GD at 33.33%). SSIUU maintained low recovery rates (13.33%).
Conclusion: Existing methods fail to remove knowledge, allowing it to resurface easily. SSIUU achieves faithful removal.

B. Internal Mechanism Analysis

Influence Variations: Analysis of $D^+$ (positive decrease) and $D^-$ (negative increase) shows that baselines rely heavily on increasing $D^-$ (spurious neurons). SSIUU achieves a substantial decrease in $D^+$ across all layers while suppressing the growth of $D^-$ .
Logit Lens Analysis: Using Logit Lens to inspect intermediate layers, the authors found that baselines (like GD) often push accuracy below chance level (over-unlearning) in specific layers, indicating a chaotic suppression mechanism. SSIUU stabilizes accuracy at the random-chance level, indicating true erasure.
Attribution Stability: After retraining attacks, SSIUU maintains a high correlation ( $\rho = 0.99$ ) in attribution distributions between pre- and post-attack models, whereas baselines show high variability and low correlation, confirming the instability of spurious neurons.

5. Key Contributions

Discovery of Shallow Alignment: The paper identifies and defines "Spurious Unlearning Neurons," demonstrating that current methods often hide knowledge via negative suppression rather than erasing the underlying representations.
Robust Evaluation Framework: The authors introduce two practical attack scenarios (adversarial and benign retraining) to rigorously test the durability of unlearning, revealing the fragility of state-of-the-art methods.
SSIUU Algorithm: A novel unlearning method that uses attribution-guided regularization to prevent the emergence of spurious neurons, ensuring knowledge is faithfully removed.
Empirical Validation: Comprehensive experiments on Llama and Qwen models show SSIUU outperforms strong baselines in both forgetting target knowledge and retaining general utility.

6. Significance

This work fundamentally shifts the understanding of machine unlearning in LLMs. It argues that faithful unlearning requires the direct removal of knowledge representations, not just the suppression of their output. By addressing the root cause of unlearning fragility (spurious neurons), SSIUU provides a pathway for safer deployment of open-source and fine-tunable LLMs, ensuring that privacy protections persist even when models are retrained or fine-tuned by users.

Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning

The Big Problem: The "Fake Forget"

The Discovery: "Spurious Unlearning Neurons"

The Test: The "Retraining Attack"

The Solution: SSIUU (The "True Eraser")

Why This Matters

Summary in One Sentence

1. Problem Statement

2. Methodology: SSIUU

Key Technical Components:

3. Experimental Setup

4. Key Results

A. Robustness Against Retraining

B. Internal Mechanism Analysis

5. Key Contributions

6. Significance

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models