Quantifying Memorization and Privacy Risks in Genomic Language Models

Imagine you have a super-smart robot that has read millions of DNA sequences. This robot, called a Genomic Language Model (GLM), is designed to help scientists understand biology, predict diseases, and discover new medicines. It's like a brilliant librarian who has memorized the entire library of life.

But here's the scary part: What if this robot remembers too much?

Just like a student who memorizes the exact answers to a practice test instead of learning the concepts, this robot might be "rote memorizing" specific people's private DNA. If someone asks the robot a question, it might accidentally spit out a real person's genetic code, revealing their identity, health risks, or family secrets. Since you can't change your DNA like you can change a password, this is a permanent privacy disaster.

This paper is a security audit designed to figure out how likely these robots are to leak secrets.

The Detective's Toolkit: Three Ways to Catch a Leaker

The researchers didn't just ask the robot, "Did you memorize this?" They used three different "detective tricks" to catch the robot in the act. Think of it like testing a vault with three different methods:

The "Too Easy" Test (Perplexity):
- The Analogy: Imagine a teacher giving a test. If a student gets a question they've seen before, they answer instantly and confidently. If they see a new question, they hesitate.
- The Test: The researchers feed the robot DNA sequences it has seen during training and new ones it hasn't. If the robot is too confident (too low "confusion") about the old sequences, it's a sign it memorized them.
The "Magic Word" Test (Canary Extraction):
- The Analogy: Imagine a spy plants a fake, unique word (a "canary") in a book. Later, they ask the robot to finish a sentence starting with that word. If the robot finishes the sentence with the exact fake word the spy planted, the spy knows the robot memorized the book.
- The Test: The researchers planted 100 fake, random DNA strings (the canaries) into the robot's training data. They then asked the robot to predict what comes next. If the robot spits out the fake string, it's a confirmed leak.
The "Did You See This?" Test (Membership Inference):
- The Analogy: A detective asks a suspect, "Did you see this specific car at the crime scene?" If the suspect's behavior changes slightly when talking about that car compared to a random car, the detective knows they were there.
- The Test: The researchers ask the robot to guess if a specific DNA sequence was part of its training data. If the robot can guess correctly more often than random chance, it means the model "remembers" who was in the training room.

The Big Experiment

To make this fair, the researchers created a controlled environment. They took four different types of "robots" (different AI architectures) and trained them on four different types of "libraries" (datasets ranging from fake random DNA to real bacteria and human yeast genomes).

They also played a game of "How many times do I show it?"
They planted the fake "canary" DNA strings into the training data 1 time, 5 times, 10 times, and 20 times. They wanted to see if showing the robot the same secret over and over made it memorize it better.

What They Found (The Plot Twist)

The results were surprising and important:

Repetition is the Enemy: Just like in regular language models, the more times a piece of data is repeated in the training, the more likely the robot is to memorize it. If they showed the fake DNA 20 times, the robot almost always memorized it perfectly.
Size Doesn't Always Save You: One of the robots was a massive, 7-billion-parameter model (a "super-genius") that was only "fine-tuned" lightly (a technique called LoRA, which is supposed to be safer). The researchers thought this big robot would be safer because it was only tweaked a little. They were wrong. This big robot memorized the secrets better than the smaller ones, recovering 100% of the fake DNA strings.
One Test Isn't Enough: This is the most critical finding.
- Robot A was great at hiding the secrets from the "Magic Word" test but failed the "Too Easy" test.
- Robot B failed the "Magic Word" test but was very obvious in the "Too Easy" test.
- Conclusion: If you only check one thing, you might think a robot is safe when it's actually leaking secrets in a different way. You need to use all three tests to get the full picture.

The Takeaway

The paper concludes that Genomic AI models are currently leaking private data.

It's not just a theoretical risk; it's happening right now. The researchers built a "Report Card" (a Maximum Vulnerability Score) that combines all three tests. They found that no single robot is perfect, and some are dangerously bad at keeping secrets.

The Moral of the Story:
Before we let these DNA-reading robots loose in hospitals and research labs, we can't just trust them. We need to run a full battery of security tests (the three vectors) to make sure they aren't accidentally spilling our genetic secrets. If we don't, we risk a future where our DNA is leaked not because of a hacker, but because an AI model simply "remembered" it too well.

Here is a detailed technical summary of the paper "Quantifying Memorization and Privacy Risks in Genomic Language Models."

1. Problem Statement

Genomic Language Models (GLMs) are increasingly used for tasks like variant prediction and regulatory element identification. However, as these models are fine-tuned on sensitive, small-scale genomic cohorts, they risk memorizing specific training sequences. Unlike natural language, genomic data presents unique privacy risks:

Immutability: Genomes cannot be "reset" or reissued if compromised.
Identifiability: Individuals can be identified from as few as several hundred variants.
Heritability: Memorized sequences can expose biological relatives who did not consent to data collection.

While memorization risks in general Large Language Models (LLMs) are well-studied, there is no systematic framework to evaluate these risks in the genomic domain, where data properties (fixed alphabet, strong biological structure) differ significantly. Existing studies often rely on single-metric evaluations, which may fail to capture the full scope of privacy leakage.

2. Methodology

The authors propose a multi-vector privacy evaluation framework designed to quantify memorization risks across different GLM architectures and training regimes.

A. Experimental Setup

Models: Four distinct GLM architectures were evaluated:
1. SimpleDNALM: A custom lightweight causal transformer (12.9M params) serving as a controlled baseline.
2. DNABERT-2: A masked language model (117M params) using BPE tokenization.
3. HyenaDNA: A long-range convolutional model (14.2M params) using the Hyena operator.
4. Evo: A massive 7B parameter model based on StripedHyena, fine-tuned using LoRA (Low-Rank Adaptation) to test parameter-efficient strategies.
Datasets: Four datasets of increasing biological complexity were used:
1. Synthetic: Zero-order sequences (no biological structure).
2. E. coli: Prokaryotic reference genome.
3. Yeast: Eukaryotic reference genome.
4. GUE: Curated multi-species promoter regions.
Canary Insertion: To enable controlled measurement, 100 distinct 64-nucleotide "canary" sequences (random, non-biological) were planted into the training corpora at varying repetition rates (1, 5, 10, and 20 copies).

B. Three-Vector Evaluation Framework

The framework unifies three complementary attack vectors into a single Maximum Vulnerability Score ( $S_{max}$ ):

Canary Sequence Extraction (Sequence Recovery):
- Method: The model is conditioned on a prefix of a canary sequence, and beam search is used to attempt recovery of the remainder.
- Metric: Exposure (based on rank among all possible sequences) and extraction success rate.
Perplexity-Based Detection:
- Method: Compares the perplexity (loss) assigned to memorized canary sequences versus held-out test sequences.
- Metric: Gap Ratio (Test Perplexity / Canary Perplexity). A ratio > 1.0 indicates the model assigns lower loss to memorized data.
Membership Inference Attack (MIA):
- Method: Uses a Likelihood Ratio Attack (LiRA) to determine if a specific sequence was in the training set based on loss distributions.
- Metric: AUC-ROC (Area Under the Curve).

C. Risk Aggregation

The final risk score is calculated as the maximum of the normalized scores from the three vectors ( $S_{config} = \max(s_{ppl}, s_{ext}, s_{mia})$ ). This "worst-case" approach ensures that if any vector reveals a vulnerability, the model is flagged as risky.

3. Key Contributions

First Systematic Framework for GLMs: Introduces a unified pipeline to quantify memorization risks specifically for genomic models, addressing a gap in the literature.
Multi-Vector Necessity: Demonstrates that different architectures leak information through different vectors (e.g., one model may be vulnerable to extraction but not perplexity analysis), proving that single-metric audits are insufficient.
Scaling Law Validation: Confirms that duplication-driven memorization scaling laws observed in natural language models (Carlini et al.) also apply to genomic data.
LoRA Evaluation: Provides empirical evidence that parameter-efficient fine-tuning (LoRA) on large pre-trained models does not inherently prevent memorization; in fact, it may concentrate risk.

4. Key Results

A. Maximum Vulnerability Scores

Evo (LoRA): Showed the highest risk ( $S_{model} = 1.00$ ). It achieved 100% extraction success on all real genomic datasets, regardless of repetition count. This challenges the assumption that LoRA limits memorization.
DNABERT-2 & HyenaDNA: Clustered in a moderate risk range ( $S_{model} \approx 0.48 - 0.55$ $S_{m o d e l} \approx 0.48 - 0.55$ ).
- DNABERT-2 was highly resistant to extraction (12-15%) but showed the strongest perplexity gap (1.51–1.61), indicating memorization exists in the model's internal representations even if not recoverable via generation.
- HyenaDNA showed low extraction and low perplexity signals but still retained a detectable membership signal.
SimpleDNALM: Showed a clear monotonic relationship between data duplication and extraction (rising from ~8% at 1x repetition to ~100% at 20x).

B. Architecture vs. Data

Architecture is the Primary Driver: The relative ordering of models regarding memorization risk remained consistent across all datasets (Synthetic, Prokaryotic, Eukaryotic).
Data Structure Impact: Evo's extraction success dropped significantly on synthetic data (63-69%) compared to real genomic data (100%), suggesting that biological structure provides contextual cues that facilitate extraction in large pre-trained models.

C. Membership Inference

All models exhibited Membership Inference AUC scores between 0.70 and 0.79, indicating that an adversary can infer training set membership at rates significantly above random guessing, even for models with low extraction rates.

5. Significance and Implications

Regulatory Compliance: The findings suggest that releasing fine-tuned GLMs without multi-vector privacy auditing poses a significant compliance risk, as single-metric tests can systematically underestimate privacy exposure.
Defense Limitations: The study highlights that standard defenses (like LoRA) are insufficient on their own. The "worst-case" scoring mechanism provides a conservative bound for auditing.
Future Directions: The paper calls for the adoption of multi-vector auditing as a standard practice in genomic AI. It also identifies limitations, such as the use of synthetic canaries, suggesting future work should test with biologically realistic probes and larger dataset scales.

In conclusion, the paper establishes that memorization is a measurable and variable risk in GLMs, heavily dependent on model architecture and fine-tuning strategy, and necessitates a comprehensive, multi-vector approach to privacy auditing.