WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols

The Big Picture: The "Right to be Forgotten" Problem

Imagine you have a giant, super-smart chef (an AI model) who has memorized a massive cookbook (the training data). One day, a customer says, "I want to forget that I ever gave you my secret family recipe. Please remove it from your memory."

In the digital world, this is called Machine Unlearning. The goal is to make the AI forget specific data so it can't accidentally reveal it later.

The Problem:
The easiest way to forget is to throw away the whole cookbook and start cooking from scratch with only the remaining recipes. But that takes forever and costs a fortune. So, instead, chefs try to just "erase" the specific recipe from the book.

The Catch:
The paper argues that when you try to erase a recipe by just scribbling over it or tearing out a page, you leave a ghost. If a nosy neighbor (an attacker) compares the "Before" book and the "After" book, they can see exactly where the scribble is. By looking at the difference, they can reconstruct the secret recipe you tried to hide.

The authors call this the "Ghost in the Machine."

Why Does the "Ghost" Appear?

The paper identifies two main reasons why these erasures fail to hide the secret:

The "Big Stain" (Large Gradient Norms):
Imagine some recipes are very complex and unique. When the chef tries to remove them, they have to make a huge, dramatic change to the book to get rid of them. This huge change leaves a massive, obvious stain. Attackers can easily spot these big stains and reverse-engineer the recipe.
- Analogy: If you try to erase a tiny pencil mark, it's hard to see. If you try to erase a giant, bold marker drawing, the paper gets torn and the hole is obvious.
The "Too Close for Comfort" (Parameter Proximity):
Most current methods try to be gentle. They tweak the book just enough to forget the recipe but keep the rest of the book looking exactly the same. Because the "After" book is so similar to the "Before" book, the difference between them is almost entirely just the secret recipe.
- Analogy: If you take a photo of a room, then move a chair one inch to the left and take another photo, the difference between the two photos is just the chair. It's easy to see exactly where the chair moved.

The Solution: WARP (Weight Teleportation)

The authors propose a new defense called WARP. Instead of just erasing the recipe, they use a magic trick called "Teleportation."

How Teleportation Works:
Deep neural networks (the AI brains) have a weird property: you can rearrange the internal wiring in many different ways without changing what the AI actually does or says. It's like a Rubik's Cube. You can twist the colors around, and the cube still solves the same puzzle, but the colors are in different spots.

The WARP Strategy:

The Erase: First, the AI tries to forget the specific data (the recipe).
The Teleport: Before or during the erasing, the AI performs a "teleport." It shuffles its internal weights (the wiring) using a mathematical symmetry.
- It keeps the AI's performance on the other recipes exactly the same (the chef still cooks great food).
- But it moves the "forgetting" process into a completely different, hidden part of the internal structure.

The Result:
When an attacker compares the "Before" and "After" models, they don't just see the secret recipe. They see the secret recipe PLUS a chaotic, random shuffle of the internal wiring.

Analogy: Imagine you want to hide a secret note in a library. Instead of just tearing the page out (which leaves a hole), you teleport the entire library to a different dimension, rearrange the books, and then tear the page out. When the attacker looks at the difference, they see a mess of rearranged books and a torn page. They can't tell which part was the secret note and which part was just the random rearrangement.

What Did They Prove?

The team tested this on six different "erasing" methods using famous datasets (like images of cats and dogs). They found:

The Attackers Got Confused: When they tried to guess if a specific image was in the training set (Membership Inference) or try to rebuild the image from the model (Reconstruction), they failed much more often.
The Numbers:
- In "Black-box" tests (where the attacker only sees the answers), WARP reduced the attacker's success by up to 64%.
- In "White-box" tests (where the attacker sees the internal code), WARP reduced success by up to 92%.
The Taste Didn't Change: Crucially, the AI didn't get dumber. It still remembered all the other recipes perfectly.

The Takeaway

WARP is like a "Privacy Shield." It realizes that simply deleting data isn't enough because the act of deleting leaves a trace. By adding a layer of "mathematical magic" (symmetry teleportation) that scrambles the internal structure without changing the output, it makes the trace of the deleted data impossible to distinguish from random noise.

It turns a simple "eraser" into a "confusion machine," ensuring that when you ask an AI to forget, it truly forgets, and no one can peek behind the curtain to see what was there.

1. Problem Statement

Machine Unlearning (MU) aims to remove the influence of specific data points (the "forget-set") from a trained model without retraining from scratch, addressing the "right to be forgotten." However, Approximate Unlearning methods (which fine-tune models rather than retraining) introduce significant privacy vulnerabilities:

The Vulnerability: An adversary with access to both the original model ( $\theta_{org}$ ) and the unlearned model ( $\theta_u$ ) can exploit the parameter difference ( $\Delta\theta = \theta_u - \theta_{org}$ ).
The Mechanism: This difference approximates the gradient of the forgotten samples. Consequently, attackers can perform:
1. Membership Inference Attacks (MIA): Determining if a specific sample was in the forget-set.
2. Data Reconstruction Attacks (DRA): Reconstructing the raw data of the forgotten samples via gradient inversion.
Root Causes: The paper identifies two factors driving these leaks:
1. Large Gradient Norms: Samples with high gradient magnitudes induce larger parameter shifts, making them easier to detect and reconstruct.
2. Parameter Proximity: Approximate unlearning methods typically make small updates to preserve utility, keeping $\theta_u$ very close to $\theta_{org}$ . This proximity ensures the difference vector $\Delta\theta$ retains a strong signal of the forgotten data.

2. Methodology: WARP (Weight Teleportation)

The authors propose WARP, a plug-and-play defense that integrates into existing gradient-based unlearning algorithms. It leverages neural network symmetries (parameter transformations that preserve the model's output) to obfuscate the signal of forgotten data.

Core Concept: Teleportation

WARP applies "teleportation" steps—transformations that move the model parameters within the same loss level set (i.e., the model's predictions remain unchanged) but alter the internal weight configuration.

Objective: The defense aims to simultaneously:
1. Reduce Forget-Set Gradient Norms: Minimize the magnitude of gradients associated with the data to be forgotten.
2. Increase Parameter Dispersion: Displace the unlearned model further from the original model in parameter space, breaking the direct correlation between $\Delta\theta$ and the specific forgotten sample.
3. Preserve Utility: Maintain high accuracy on the "retain-set" (data that should not be forgotten).

Mathematical Formulation

The defense optimizes a teleportation loss function:
$\mathcal{L}_{tel}(\theta) = \sum_{(x,y) \in D_f} \|\nabla_\theta \ell(f(x; \theta), y)\|_2^2 - \beta \|\theta - \theta_{org}\|_2^2$
Subject to the constraint that the loss on the retain-set does not increase significantly ( $\ell_r(g \cdot \theta) \le \ell_r(\theta) + \epsilon$ ).

Implementation: Retain Null-Space Projection

To implement this efficiently without training-time statistics, WARP uses Retain Null-Space Projection:

Subspace Estimation: It computes the subspace spanned by the inputs of the retain-set (using SVD on the activation matrix of a retain minibatch).
Orthogonal Projection: It projects the unlearning updates onto the orthogonal complement (null space) of this retain subspace.
Effect: This ensures that updates move the model in directions that do not affect the retain-set's predictions (preserving utility) but significantly alter the gradient geometry regarding the forget-set. This "scrambles" the gradient signal, making it difficult for attackers to invert.

3. Key Contributions

Tailored Privacy Attacks:
- U-LiRA (Black-box): Adapts the LiRA membership inference attack to the unlearning setting using shadow models.
- Gaussian Gradient-Difference (White-box): A new attack exploiting the difference between pre- and post-unlearning gradients.
- Subspace-Filtered Reconstruction: A novel white-box reconstruction attack that uses Singular Value Decomposition (SVD) to filter out retain-set gradients, isolating the forget-set gradient for inversion.
Symmetry-Based Defense (WARP):
- Introduces a general framework for using neural network symmetries to reduce privacy leakage.
- Demonstrates that WARP is agnostic to the specific symmetry mechanism (e.g., null-space projection or change-of-basis) and can be applied to various unlearning algorithms.
Comprehensive Evaluation:
- Evaluated across six state-of-the-art unlearning algorithms (NGP, SCRUB, PGU, SalUn, SF, BT).
- Tested on three datasets (CIFAR-10, Tiny-ImageNet, ImageNet-1K) using ResNet-18 and ViT-B/16.
- Validated under both black-box and white-box threat models.

4. Experimental Results

The results demonstrate that WARP significantly mitigates privacy risks while maintaining model utility.

Membership Inference (MIA):
- Black-box: WARP reduced the adversarial advantage (AUC) by up to 64% across methods. For highly memorized samples, the True Positive Rate (TPR) at low False Positive Rates (FPR) dropped drastically (e.g., NGP TPR@1% dropped from 0.030 to 0.014).
- White-box: WARP reduced AUC by up to 92% (e.g., PGU improved by 92.9%). The defense effectively flattened ROC curves toward random guessing.
Data Reconstruction (DRA):
- On ImageNet-1K, WARP reduced the quality of reconstructed images significantly.
- PSNR: Dropped from 10.74 dB (baseline) to 7.38 dB (+45.5% degradation in quality).
- Feature MSE: Increased by 52.2%, indicating the reconstructed features were far from the original data.
- Visualizations showed that reconstructions became semantically poor, often resembling class priors rather than the actual forgotten image.
Utility Preservation:
- Test accuracy on the retain-set remained stable or slightly improved in most cases.
- The only notable accuracy drop was ~1% for NGP, which was analyzed as a trade-off that still yielded a superior privacy-utility frontier compared to baselines.
Robustness:
- WARP remained effective even against an adaptive attacker who knew the teleportation family and attempted to optimize over the symmetry parameters (change-of-basis scales). The increased variance in the symmetry orbit made the inverse problem ill-posed for the attacker.

5. Significance and Conclusion

Paradigm Shift: The paper reframes unlearning privacy not just as a matter of "forgetting" data, but as a geometric problem of gradient norm reduction and parameter space dispersion.
Generalizability: WARP is a "plug-and-play" module that does not require retraining or access to training-time statistics, making it applicable to any pre-trained model and existing unlearning algorithm.
Theoretical Insight: The authors provide information-theoretic bounds showing that injecting symmetry-induced noise increases the conditional entropy of the input given the gradient, thereby provably raising the lower bound on reconstruction error.
Practical Impact: The work highlights that current approximate unlearning methods are vulnerable to sophisticated white-box attacks and that teleportation is a critical, general tool for securing the "right to be forgotten" in deep learning systems.

In summary, WARP successfully decouples the utility of unlearning from its privacy risks by exploiting the inherent symmetries of neural networks, offering a robust defense against both membership inference and data reconstruction attacks.