When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

The Big Picture: Protecting Your Digital Photos

Imagine you have a private photo album. You don't want a giant tech company to use your photos to train their AI. To stop them, you decide to add a tiny, invisible "glitch" to every photo.

This glitch is so subtle that your eyes can't see it, but it tricks the AI into learning the wrong things. Instead of learning "This is a cat," the AI learns "This is a cat because of this tiny glitch." This is called Unlearnable Examples (UEs). It's like putting a "Do Not Read" spell on your book.

The Problem: The AI is Too Smart (It Has "Pretraining")

For years, these "Do Not Read" spells worked great on AI models that were learning from scratch (like a baby learning to talk for the first time).

However, modern AI doesn't start from scratch. It starts with Pretraining.

The Analogy: Imagine the AI isn't a baby anymore; it's a PhD student who has already read millions of books and studied the world extensively.
The Failure: When you try to trick this PhD student with your tiny glitch, they ignore it. They say, "I know what a cat looks like because I've seen a million cats. I don't need to rely on your weird glitch." They use their existing knowledge (their "priors") to see past the trick and learn the truth anyway.

The paper's main discovery: Existing protection methods fail when the AI is already smart (pretrained). The AI's "common sense" overrides your "Do Not Read" spell.

The Solution: BAIT (Binding Artificial Perturbations to Incorrect Targets)

The authors created a new method called BAIT to trick the PhD student. Instead of just adding a glitch, they change the rules of the game entirely.

Here is how BAIT works, using a "Wrong Answer Key" analogy:

The Old Way (Failing): You show the AI a picture of a cat with a glitch and say, "This is a cat." The AI thinks, "I know it's a cat, ignore the glitch."
The BAIT Way: You show the AI the picture of the cat with the glitch, but you force the AI to believe: "This is actually a toaster."
- The Inner Game: First, the AI tries to learn normally (Cat = Cat).
- The Outer Game (The Trap): BAIT constantly corrects the AI, saying, "No! If you see this glitch, you must call it a toaster!"

The Strategy:
BAIT uses a "Curriculum" (a step-by-step lesson plan) to make the trick harder and harder:

Step 1 (Easy): Trick the AI into calling a cat a "dog" (similar things).
Step 2 (Medium): Trick the AI into calling a cat a "car" (random things).
Step 3 (Hard): Trick the AI into calling a cat a "banana" (completely unrelated things).

By forcing the AI to associate the image with a completely wrong label (a toaster or a banana) every single time, the AI gets confused. It can no longer rely on its "PhD knowledge" to figure out the truth. It is forced to rely on the glitch to get the "right" answer (which is actually the wrong answer).

The Results

The paper tested this on many different types of AI models (like ResNet, VGG, and Vision Transformers) using standard datasets (like CIFAR-10 and ImageNet).

Before BAIT: The AI ignored the protection and learned the images correctly (High accuracy).
With BAIT: The AI got confused and failed completely. It could no longer tell a cat from a toaster. Its accuracy dropped to the level of random guessing (like flipping a coin).

Why This Matters

This is a huge step forward for digital privacy.

The Old Reality: If you wanted to protect your data, you had to hope the AI wasn't too smart.
The New Reality: With BAIT, you can protect your data even against the most advanced, pre-trained AI models. It ensures that if someone tries to steal your data to train their AI, they will only learn nonsense.

Summary in One Sentence

The paper found that old tricks to hide data from AI don't work on smart, pre-trained AI, so they invented a new "double-layered" trick (BAIT) that forces even the smartest AI to learn the wrong things, effectively locking your data away.

1. Problem Statement

Unlearnable Examples (UEs) are a data protection strategy designed to prevent unauthorized model training by injecting imperceptible perturbations into data. These perturbations force models to learn spurious correlations (shortcuts) between the noise and labels, causing the model to fail on clean test data (dropping accuracy to chance levels).

However, existing UE methods are primarily designed for models trained from scratch (randomly initialized). The paper identifies a critical vulnerability: UEs fail when applied to pretrained models.

The Phenomenon: Pretrained models possess rich semantic priors (knowledge of edges, textures, and object structures) acquired during pretraining. When fine-tuned on UE-perturbed data, these models can bypass the injected "shortcuts" and rely on their inherent semantic understanding to learn the true underlying features, thereby nullifying the protection.
The Gap: There is a lack of systematic investigation into why UEs fail against pretrained backbones and a lack of countermeasures to restore unlearnability in this setting.

2. Methodology: BAIT (Binding Artificial perturbations to Incorrect Targets)

To address this, the authors propose BAIT, a novel bi-level optimization framework designed to explicitly disrupt the semantic-label pathways provided by pretraining priors.

Core Mechanism

BAIT operates by forcing the model to rely on the injected perturbations rather than the semantic content of the image. It achieves this through a mislabel-perturbation binding mechanism:

Inner Level (Simulation): The model parameters ( $\theta$ ) are optimized to align perturbed inputs with their true ground-truth labels. This simulates standard data-label alignment, leveraging the pretraining priors to ensure the model "wants" to learn the correct semantics.
Outer Level (Disruption): The perturbations ( $\delta$ $δ$ ) are optimized to map the perturbed samples to designated incorrect target labels ( $y_j$ $y_{j}$ where $j \neq i$ $j \neq = i$ ).
- Instead of the standard UE goal of mapping $x+\delta \to y_{true}$ , BAIT enforces $x+\delta \to y_{incorrect}$ .
- This creates a conflict: The inner level tries to learn the true semantics, while the outer level forces the model to associate the specific perturbation pattern with a semantically distinct, incorrect class.
- Result: The model is forced to ignore the semantic priors (which point to the true label) and rely solely on the perturbation-shortcut to minimize loss, effectively "unlearning" the true semantics.

Optimization Strategy

Meta-Learning: Since the bi-level objective is intractable, the authors use a "look-ahead" strategy. They unroll the inner optimization for $N$ steps to approximate how the current perturbation affects the outer objective.
Curriculum-Guided Target Selection: To enhance robustness, the incorrect target labels are selected dynamically through three stages:
1. Hard Negatives: Classes with the highest logit scores (most easily confused).
2. Random Classes: To improve generalization.
3. Most Dissimilar Classes: Classes with the lowest logit scores (semantically unrelated), representing the hardest case.

3. Key Contributions

Vulnerability Discovery: The paper empirically demonstrates that pretraining priors allow models to circumvent existing UE shortcuts. Experiments show that as pretraining layers are removed (replaced with random initialization), the effectiveness of UEs increases, confirming that priors are the primary factor enabling models to bypass protection.
BAIT Framework: Introduction of a bi-level optimization method that binds perturbations to incorrect targets. This overrides the semantic guidance of priors, forcing the model to rely on spurious perturbation-label correlations.
Comprehensive Evaluation: Extensive experiments across multiple datasets (CIFAR-10/100, SVHN, Flowers102, ImageNet subsets) and diverse architectures (ResNet, VGG, DenseNet, ViT, Swin Transformer).

4. Experimental Results

The authors evaluated BAIT against state-of-the-art UE baselines (EMN, TUE, REM, LSP, GUE, 14A) using ImageNet-pretrained backbones.

Performance on Pretrained Backbones:
- Existing methods (e.g., EMN, TUE) failed significantly, often retaining 50–80% test accuracy on clean data, indicating they learned real semantics.
- BAIT successfully reduced test accuracy to near chance levels (e.g., ~14% on CIFAR-10, ~20% on CIFAR-100) across all tested backbones.
Transferability:
- Cross-Dataset: Perturbations optimized on CIFAR-10 remained effective against models pretrained on CIFAR-100 or SVHN.
- Cross-Architecture: BAIT worked effectively on both CNNs (ResNet, VGG) and Vision Transformers (ViT, Swin), though ViTs showed slightly higher resilience (a known challenge in the field).
- Cross-Task: Preliminary results on image segmentation (Pascal VOC) showed BAIT could reduce mIoU, indicating potential for cross-task transferability.
Robustness to Defenses: BAIT maintained effectiveness against standard data augmentations (Cutout, Mixup, CutMix) and JPEG compression, outperforming baselines even under aggressive compression.
Ablation Studies: The curriculum-guided target selection (Hard Negative $\to$ Random $\to$ Dissimilar) was shown to be critical, with the full three-stage strategy yielding the best results.

5. Significance and Implications

Privacy in the Era of Transfer Learning: As the machine learning community shifts toward fine-tuning pretrained foundation models, traditional data protection methods (UEs) are becoming obsolete. This paper highlights a critical security gap and provides a necessary solution.
Theoretical Insight: The work deepens the understanding of how pretraining priors interact with spurious correlations. It proves that strong priors can override artificial shortcuts unless the shortcut is explicitly engineered to conflict with the semantic truth (via mislabeling).
Practical Application: BAIT offers a viable strategy for individuals and organizations to protect their data from being used to fine-tune large models without permission, ensuring that even if data is leaked, it cannot be used to improve model performance.

In conclusion, BAIT represents a paradigm shift in Unlearnable Examples, moving from simple noise injection to a sophisticated, prior-aware adversarial strategy that effectively neutralizes the advantages of pretraining.