Suppression or Deletion: A Restoration-Based Representation-Level Analysis of Machine Unlearning

The Big Picture: The "Right to be Forgotten" Problem

Imagine you have a giant, super-smart library (a Pre-trained AI Model) that has read millions of books from the internet. Sometimes, people realize a specific book in that library contains their private diary or a copyrighted story they didn't want shared. They ask the librarian to "forget" it.

In the world of AI, this is called Machine Unlearning. The goal is to make the AI "un-remember" specific information so it can't be used or leaked anymore.

However, this paper argues that most current methods for "forgetting" are like a magician's trick: they make the information look gone, but it's actually still hiding in the back of the closet.

The Core Problem: Hiding vs. Deleting

The authors say there are two ways an AI can "forget":

True Deletion: You take the book out of the library, burn it, and erase the memory of its existence from the librarian's mind. It is gone forever.
Suppression (The Trick): You tell the librarian, "If anyone asks about this book, say 'I don't know' or give them a fake answer." But deep down, the librarian still remembers the story perfectly. They are just pretending to forget.

The Problem: Current tests only check what the librarian says (the output). If the librarian says "I don't know," the test says, "Great, you forgot!" But the paper shows that the librarian is actually just suppressing the answer, not deleting the memory.

The Solution: The "Feature Detective" (SAEs)

To catch the librarian lying, the authors built a new tool called a Restoration-Based Framework. Here is how it works, using an analogy:

Imagine the AI model is a complex factory assembly line.

Early stations build basic parts (like wheels or screws).
Middle stations assemble the engine (this is where the "meaning" or "concepts" live).
Final stations paint the car and put it on the showroom floor (this is the final answer).

The authors used a tool called a Sparse Autoencoder (SAE). Think of the SAE as a high-tech X-ray scanner that can look inside the middle stations of the factory. It can identify specific "expert workers" who are responsible for recognizing specific things (like "birds" or "gas pumps").

The Experiment:

They took an AI that was supposed to have "forgotten" a specific class (e.g., birds).
They used the X-ray scanner to find the "bird expert" workers in the middle of the factory.
They forced those workers to wake up and do their job again (this is called "steering").
They watched what happened at the end of the assembly line.

The Shocking Result:
Even though the AI was supposed to have forgotten birds, the moment they "woke up" the bird experts in the middle, the AI immediately started recognizing birds again with 90–100% accuracy.

The Conclusion: The information wasn't deleted. It was just suppressed. The "bird" knowledge was still sitting right there in the middle of the factory, waiting to be turned on.

Key Findings in Plain English

Most Methods are Just "Hiding" the Truth:
The paper tested 12 different ways to make AI forget. Almost all of them were just "suppression." They changed the final answer but left the internal memory intact. It's like putting a "Do Not Enter" sign on a room full of furniture; the furniture is still there.
Even "Retraining" Isn't Safe:
You might think, "If I just retrain the AI from scratch without the bad data, it will be safe." The authors found that even this doesn't work perfectly. Because the AI learned so much from the internet before it was fine-tuned, the deep "concepts" (like what a bird looks like) are so strong that they survive the retraining. The "bird" memory is too deeply ingrained to be erased just by retraining.
The "Middle" is Where the Magic Happens:
The information that needs to be deleted is usually hidden in the middle layers of the AI. If you only check the final answer, you miss the hidden memory. To truly delete something, you have to go into the middle of the factory and physically remove the "bird experts."
Only One Method Worked Well:
One method called EU-K (which involves resetting specific layers) actually managed to delete the information. It was the only one where the "X-ray scanner" couldn't wake up the bird experts. This suggests that to truly delete data, you have to be aggressive and change the internal structure of the AI, not just tweak the final answer.

Why Should You Care? (The Real-World Risk)

Imagine you are a company that buys a pre-trained AI model to help your customers. You ask the developer to remove a specific customer's private data. They say, "Done! We ran our tests, and the model doesn't know that data anymore."

Based on this paper, you should be skeptical.

The model might be "lying" (suppressing).
If a hacker or a clever researcher uses the "X-ray scanner" (the restoration technique), they could unlock that private data instantly.
This is dangerous for privacy laws (like GDPR) and copyright.

The Takeaway

The authors are calling for a new rulebook. We can no longer just ask, "Does the AI give the right answer?" We need to ask, "Is the memory actually gone from the inside?"

They propose that future AI safety checks must include these "X-ray" tests to ensure that when we say "Delete," we really mean Delete, not just "Pretend to forget."

1. Problem Statement

As pretrained models are increasingly shared and redistributed, there is a critical need to remove sensitive, copyrighted, or private data upon request (Machine Unlearning or MU). While various "approximate unlearning" methods have been proposed to avoid the high cost of full retraining, current evaluation metrics are insufficient.

The Gap: Existing evaluations rely on output-based metrics (e.g., accuracy on the "forget set" or Membership Inference Attacks). These metrics only verify that the model outputs the wrong class for the forgotten data.
The Risk: Output-based metrics cannot determine if the information has been truly deleted from the model's internal memory or merely suppressed at the decision boundary. If information is merely suppressed, the semantic features remain encoded in intermediate layers, posing a significant privacy risk if the model is steered or fine-tuned later.

2. Methodology: The Restoration-Based Framework

The authors propose a novel framework to distinguish between Suppression (features remain but are masked) and Deletion (features are completely removed). The framework utilizes Sparse Autoencoders (SAEs) and Inference-Time Steering.

Core Components:

Feature Selection (SAE):
- The authors train Sparse Autoencoders on the intermediate layers of the model to identify "expert features"—specific latent activations that strongly correlate with a target class.
- They filter for features that activate frequently for the target class but rarely for others, calculating an F1 score to select the top $K$ features per class.
- Validation: An ablation study confirmed that removing these specific features causes a massive drop (>80%) in accuracy for the target class while leaving other classes unaffected, proving their class-specificity.
Selective Restoration (Steering):
- For a given unlearned model, the authors extract intermediate activations ( $h_{unl}$ ).
- They identify the "expert features" corresponding to the forgotten class.
- They steer the unlearned model's representation by replacing the unlearned feature values with the values from the original (pre-unlearning) model, weighted by a steering coefficient $\alpha$ :
  $\hat{h}[j] = h_{unl}[j] + \alpha(h_{orig}[j] - h_{unl}[j])$
- The steered representation is decoded and passed through the remaining layers to generate a prediction.
Evaluation Metric:
- Restoration Rate: If the model's accuracy on the "forget set" increases significantly after steering, it indicates Suppression (the information was still there, just masked).
- If accuracy remains low, it indicates Deletion (the information was truly removed).

3. Key Contributions

Novel Analysis Framework: Introduced a restoration-based approach using SAEs to quantitatively distinguish between suppression and deletion at the representation level, bypassing the limitations of output-based metrics.
Comprehensive Empirical Study: Applied this framework to 12 major unlearning methods (including Retrain, Finetune, AdvNegGrad, SCRUB, SalUn, etc.) across image classification tasks (CIFAR-10 and ImageNette).
New Evaluation Guidelines: Proposed a shift in MU evaluation standards, prioritizing representation-level verification over output-based metrics, especially for privacy-critical applications.

4. Experimental Results

The study evaluated 12 methods on ViT-B/16 models. The results revealed three critical observations:

Prevalence of Suppression: Most approximate unlearning methods (e.g., AdvNegGrad, SCRUB, RandomLabel, SalUn) achieved 0% accuracy on the forget set initially but showed near-perfect restoration rates (often >90% or 100%) after steering. This proves they only suppress the output decision while preserving semantic features in intermediate layers.
The "Retrain" Paradox: Even Retraining from scratch (the gold standard for unlearning) showed high restoration rates. This reveals that robust semantic features inherited from the initial pretraining (e.g., ImageNet) persist even after retraining on a smaller dataset, meaning simple retraining does not guarantee deletion of pre-trained knowledge.
Layer Depth and Dataset Complexity:
- Restoration peaks in "semantic bottleneck" layers. For simpler datasets (CIFAR-10), this occurs in middle layers (8–9). For complex datasets (ImageNette), it shifts to deeper layers (9–10).
- This implies unlearning must be layer-aware; uniform modifications fail to target the specific layers where class information is concentrated.
Effective Deletion Methods: Only methods that perform structural modifications or layer resets (e.g., EU-K, which resets layer weights) achieved 0% restoration rates. Methods relying on loss manipulation or weight dampening (e.g., SSD, Bad-T) showed mixed but generally higher restoration rates than EU-K.

5. Significance and Implications

Redefining "Unlearning": The paper demonstrates that current state-of-the-art unlearning methods are largely ineffective for true privacy guarantees. A model that passes standard output tests may still "remember" sensitive data in its latent space, making it vulnerable to reconstruction or fine-tuning attacks.
Security Risks: In the era of model sharing (e.g., Hugging Face), models distributed as "unlearned" could be easily restored to their original state by malicious actors using the proposed steering technique, leading to data leakage.
Future Guidelines: The authors advocate for:
1. Mechanistic Verification: Evaluations must include representation-level testing, not just output accuracy.
2. Layer-Aware Design: Unlearning algorithms must target specific semantic bottleneck layers.
3. Structural Changes: True deletion likely requires direct modification of intermediate representations (e.g., layer re-initialization) rather than just gradient-based adjustments to the loss function.

In conclusion, the paper argues that without representation-level verification, the "Right to be Forgotten" in the age of large pretrained models is currently an illusion for most existing unlearning techniques.

Suppression or Deletion: A Restoration-Based Representation-Level Analysis of Machine Unlearning

The Big Picture: The "Right to be Forgotten" Problem

The Core Problem: Hiding vs. Deleting

The Solution: The "Feature Detective" (SAEs)

Key Findings in Plain English

Why Should You Care? (The Real-World Risk)

The Takeaway

1. Problem Statement

2. Methodology: The Restoration-Based Framework

Core Components:

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation