Machine Unlearning for GDPR Right-to-Erasure in Antimicrobial Resistance Prediction Models

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a massive, super-smart library assistant (an AI model) that helps doctors predict which antibiotics will work best against a patient's infection. This assistant has read millions of patient records to learn its craft.

However, there's a new rule called the "Right to be Forgotten" (part of the GDPR privacy law). It says: If a patient asks, "Please delete all my data," you must not only delete their file from the filing cabinet, but you must also make sure the library assistant has completely forgotten them. The assistant must act as if that patient never existed.

The Problem: The "Re-Learn Everything" Trap

Currently, if a patient asks to be forgotten, the only way to guarantee the assistant has truly forgotten them is to fire the assistant and hire a new one, training them from scratch on all the other millions of records.

The Analogy: Imagine you are a chef who has memorized a recipe book with 1 million pages. If one person says, "Please remove my favorite dish from your memory," the current rule says you must throw away the whole book, rewrite the entire 1 million pages without that one dish, and re-memorize it all.
The Result: This takes forever. If you get 50 deletion requests a month, your kitchen (computer) would be busy re-writing the book all day, every day. It's too slow and expensive.

The Solution: The "SISA" Method

This paper introduces a clever new way to handle this called SISA (Sharded, Isolated, Sliced, and Aggregated).

The Analogy: Instead of one giant library assistant reading one giant book, imagine you hire 5 different assistants. You split the 1 million-page book into 5 separate, smaller books (shards). Each assistant only reads their own 200,000 pages.
How they work together: When a doctor asks a question, all 5 assistants read their specific pages and vote on the answer. The final answer is the average of their votes.
The Magic of Deletion: Now, if a patient asks to be forgotten, you don't fire everyone. You just check which of the 5 assistants has that patient's data in their specific book. You fire only that one assistant, retrain them on their small 200,000-page book (without the patient), and put them back to work. The other 4 assistants keep doing their jobs.

What the Study Found

The researchers tested this "5-assistant" method against the old "fire everyone" method using real medical data from two sources:

Hospital Records (EHR): Over 1.2 million patient records.
Genomic Data: Over 400,000 bacterial DNA records.

Here is the breakdown of their findings in simple terms:

1. Speed: A Massive Win

Old Way: Retraining the whole model took about 67 seconds per deletion.
SISA Way: Retraining just one small piece took only 7.5 seconds.
The Result: SISA was 9 times faster. Over a year, this saves hours of computer time, making it possible to handle deletion requests instantly rather than waiting days.

2. Accuracy: Still Smart

The big fear was: "If we only retrain a small piece, will the assistant get dumber?"
The Result: The accuracy dropped by a tiny, almost invisible amount (less than 0.05%). This is well within the safe zone for medical decisions. The assistant is still just as smart as before.

3. Privacy: Did they really forget?

The researchers tested if a hacker could still guess which patient was deleted by looking at the model's answers.
The Result: The "SISA" method successfully removed the patient's influence, satisfying the privacy law. Interestingly, the study found that these specific medical AI models are already quite good at not "memorizing" patients too deeply, but SISA ensures the legal requirement is met.

4. What Didn't Work
The study tried other "tricks" to make forgetting faster, like:

Label Flipping: Telling the computer "pretend this patient's data means the opposite." (This was slow and didn't help).
Tree Pruning: Cutting off parts of the decision-making process. (This was fast but made the model less accurate on hospital data, which is dangerous for medical use).

The Bottom Line

This paper proves that you don't need to burn down the whole library to remove one book. By splitting the work into smaller, independent teams (SISA), hospitals can:

Comply with privacy laws (delete patient data instantly).
Save massive amounts of computing power.
Keep their medical predictions accurate.

It's a practical, efficient blueprint for making AI in healthcare both smart and respectful of patient privacy.

1. Problem Statement

The study addresses a critical regulatory and computational challenge in clinical machine learning (ML): compliance with the General Data Protection Regulation (GDPR) Article 17 (Right to Erasure).

The Challenge: Healthcare ML models trained on Electronic Health Records (EHR) and genomic data must be capable of removing specific patient data contributions upon request.
The Bottleneck: The current standard for compliance is full model retraining, which is computationally prohibitive for large-scale datasets (e.g., >1 million records). For example, retraining an AMR model on 1.2 million records takes ~67 seconds per deletion request. At a realistic volume of 50 monthly requests, the annual cumulative cost exceeds 800 seconds, creating a significant operational burden.
The Gap: While "Machine Unlearning" algorithms exist, they have not been systematically evaluated for Antimicrobial Resistance (AMR) prediction, which involves heterogeneous feature spaces, imbalanced phenotypes, and strict accuracy requirements necessary for clinical decision support.

2. Methodology

The authors conducted a systematic comparison of five machine unlearning strategies across two distinct datasets using Random Forest classifiers (500 estimators).

Datasets

ARMD (Antibiotic Resistance Microbiology Dataset):
- Source: Stanford Health Care EHR (1999–2024).
- Size: 1,245,767 records.
- Modality: Clinical EHR (Organism, antibiotic, age, gender, ward).
- Resistance Rate: ~37%.
PATRIC/BV-BRC:
- Source: NIH Bacterial and Viral Bioinformatics Resource Center.
- Size: 400,372 records.
- Modality: Genomic surveillance (Species, antibiotic, MIC, method, host).
- Resistance Rate: ~62.5%.

Experimental Design

Forget Set: 500 unique records per dataset (simulating a monthly GDPR deletion batch).
Baseline: Full Retraining (Gold Standard).
Evaluated Unlearning Methods:
1. SISA (Sharded, Isolated, Sliced, and Aggregated): Data is partitioned into $k=5$ independent shards. Only the shard containing the forget set is retrained.
2. Label-Flip Retraining: Forget set records are relabeled to the opposite class and retrained on the full dataset (heuristic baseline).
3. Influence Reweighting: Forget set records are assigned negligible weights ( $10^{-6}$ ) during full-dataset retraining.
4. Selective Tree Pruning: Trees with low error on the forget set are removed without retraining.

Evaluation Metrics

Performance: Accuracy and AUC-ROC. A 0.5% accuracy degradation threshold was set as the clinical limit for acceptable performance drift.
Privacy: Membership Inference Attack (MIA) gap to measure how well the model "forgets" the data.
Efficiency: Wall-clock unlearning time, speedup ratio relative to full retraining, and cumulative 12-month deletion cost.

3. Key Results

Performance on ARMD (Clinical EHR)

SISA: Achieved an 8.9× speedup (7.5s vs. 66.7s) with an accuracy drop of only 0.024% (well within the 0.5% threshold).
Label-Flip & Influence Reweighting: Provided no speedup (0.8×) as they required full dataset retraining.
Tree Pruning: Fastest (78.5× speedup) but failed the clinical threshold with a +0.648% accuracy drop, making it unsuitable for EHR deployment.

Performance on PATRIC (Genomic)

SISA: Achieved a 9.8× speedup (1.4s vs. 13.4s) with an accuracy drop of 0.048%.
Tree Pruning: Performed well here (0.045% drop), highlighting its modality-specific sensitivity.
Label-Flip: Degrading accuracy by 0.389%, approaching the failure threshold.

Privacy and Long-term Cost

Privacy: All methods, including the original model, showed low MIA gaps ( $\approx 2.3 \times 10^{-3}$ ), suggesting Random Forests on tabular AMR data have inherent privacy robustness. The primary driver for unlearning is regulatory compliance, not necessarily mitigating empirical attacks.
Cumulative Cost: Over 12 months (50 requests/month), SISA reduced the annual compliance overhead from 800s to 90s (ARMD) and 160s to 16s (PATRIC).

4. Key Contributions

First Systematic Evaluation: This is the first study to evaluate machine unlearning specifically for AMR prediction models across both clinical EHR and genomic data modalities.
Validation of SISA: Demonstrated that SISA is the only method that simultaneously satisfies computational efficiency, accuracy preservation (<0.5% drift), and cross-modality generalizability.
Negative Results for Alternatives: Proved that approximate methods like Label-Flip and Influence Reweighting offer no computational advantage over full retraining for tree-based models, and that Tree Pruning is unreliable across different data types.
Operational Feasibility: Showed that SISA enables interactive processing of deletion requests (sub-10 seconds) rather than requiring overnight batch jobs, making GDPR compliance practical for clinical informatics teams.

5. Significance

Regulatory Compliance: Provides a scalable, evidence-based framework for healthcare institutions to comply with GDPR "Right to Erasure" without sacrificing model utility or incurring prohibitive computational costs.
Clinical Decision Support: Establishes that unlearning can be performed without crossing the critical 0.5% accuracy threshold, ensuring that AMR prediction models remain safe for guiding antimicrobial stewardship.
Infrastructure Impact: Suggests that standard ML tooling (Random Forests + SISA) is sufficient for compliant AMR systems, removing the need for specialized hardware or complex deep learning unlearning architectures.

Conclusion: The authors recommend SISA training as the operational standard for maintaining GDPR-compliant AMR prediction models, as it is the only evaluated method that successfully balances speed, accuracy, and generalizability across diverse clinical and genomic datasets.