How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

Imagine you have a secret recipe for a delicious cake. To share the idea of the cake with a friend without giving them the actual recipe, you decide to send them a summary instead of the full list of ingredients. You think, "If I just tell them the cake is 'sweet, fluffy, and chocolatey,' they can't steal my secret recipe, right?"

This paper is about testing exactly that idea, but with DNA instead of cake recipes.

The Setting: The "DNA Cloud"

In modern medicine, scientists use massive AI models (called Foundation Models) to understand DNA. These models are like super-smart librarians who have read every human genome ever.

The Problem: People want to use these models to help with research, but they can't share the raw DNA (the actual "recipe") because it's too private. It's like your fingerprint; it identifies you uniquely and never changes.
The Proposed Solution: Instead of sharing the raw DNA, institutions share Embeddings. Think of an embedding as a digital fingerprint or a summary vector. It's a long list of numbers that captures the "essence" of the DNA sequence without showing the letters (A, C, G, T) directly.
The Service: This is called Embeddings-as-a-Service (EaaS). You send your DNA to the cloud, the cloud turns it into a summary (embedding), and sends that summary back to researchers. The promise is: "This summary is safe. You can't get the original DNA back from it."

The Attack: The "Reverse Engineer"

The authors of this paper asked a scary question: "What if someone tries to reverse-engineer the summary to get the original recipe back?"

They set up a scenario where a "hacker" (an adversary) intercepts these summaries and tries to use a different AI to reconstruct the original DNA sequence. It's like giving someone a blurry photo of a face and asking them to draw the person's face perfectly based only on that photo.

The Experiments: Testing Three "Librarians"

They tested three different types of DNA AI models (DNABERT-2, Evo 2, and NTv2) using two different ways of making the summary:

The "Per-Token" Summary (The Detailed List):
- Analogy: Imagine the AI breaks the DNA sentence into words and gives you a summary for every single word in order.
- Result: Total Failure. The hackers could reconstruct the DNA almost perfectly (99% accuracy).
- Takeaway: If you share a word-by-word summary, you might as well just share the raw DNA. It offers zero privacy.
The "Mean-Pooled" Summary (The Blurry Average):
- Analogy: Imagine the AI takes the whole sentence, mixes all the words together in a blender, and gives you one single "flavor profile." You lose the order and the specific words, but you get a general idea.
- Result: Partial Failure. It was harder to reconstruct, but the hackers still did surprisingly well, especially with short DNA snippets.
- The "Short vs. Long" Twist:
  - Short sequences (10-20 letters): The "blender" didn't mix enough. The summary was still too clear. Hackers could reconstruct 90%+ of the DNA.
  - Long sequences (100 letters): The "blender" worked better. The summary became more scrambled, making it harder to guess the original. However, it was still much better than random guessing.

The Secret Sauce: How the AI "Reads"

The paper found that the way the AI breaks up the DNA matters a lot.

Evo 2 & NTv2: These models read DNA like a typewriter, one letter at a time (or in fixed chunks). This makes it easy for hackers to reverse-engineer the summary.
DNABERT-2: This model uses a trick called BPE (Byte Pair Encoding). It's like reading a sentence and grouping common words together (e.g., "th" and "e" become "the").
- Analogy: If the summary says "The," the hacker doesn't know if it was "The," "The," or "The" (if the original was split differently). This creates confusion.
- Result: DNABERT-2 was the hardest to hack because the "grouping" made the summary ambiguous. It's the most secure of the three, though still not perfect.

The Big Reveal

The most important discovery was a simple rule: If the summary looks similar to another summary, the original DNA is also similar.

Analogy: If two cake summaries both say "very sweet and chocolatey," the cakes are likely very similar.
Because the AI preserves this relationship so well, the hacker can just look at the summary and say, "This looks like that DNA I know," and guess the rest. The paper found that the more the summary preserves the "shape" of the DNA, the easier it is to steal the DNA back.

The Conclusion: "Don't Trust the Summary"

The paper concludes that DNA embeddings are not safe enough yet.

Sharing detailed summaries is like sharing the raw data.
Sharing averaged summaries is like sharing a blurry photo; it's better, but a skilled hacker can still make out the face, especially for short snippets.

The Warning: Before we start sharing DNA summaries widely in hospitals and research labs, we need to invent better ways to scramble them (like adding noise or using better encryption). Otherwise, we might be accidentally handing out our most private biological secrets in the form of "safe" numbers.

In short: The "privacy shield" of DNA embeddings is currently full of holes. If you share the summary, you might as well be sharing the secret.

Here is a detailed technical summary of the paper "How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences."

1. Problem Statement

The rapid adoption of DNA Foundation Models (FMs) has led to the emergence of Embeddings-as-a-Service (EaaS) frameworks, where organizations share dense vector representations (embeddings) of genomic sequences rather than raw data to facilitate downstream tasks like classification or regression. While intended to protect privacy, this paper investigates whether these embeddings effectively mask sensitive genomic information.

The core problem is the vulnerability of these embeddings to Model Inversion Attacks. In this scenario, an adversary intercepts shared embeddings and attempts to reconstruct the original raw DNA sequences. Given that genomic data is immutable and uniquely identifying, successful reconstruction would constitute a severe privacy breach, potentially exposing patient identities and genetic traits.

2. Methodology

A. Threat Model

The study defines a scenario where:

Data Owner (Institution I₁): Holds a labeled dataset of DNA sequences. They use a pre-trained DNA Foundation Model to generate embeddings and share these with a Legitimate User (Institution I₂).
Adversary: Intercepts the shared embeddings. The adversary trains a reconstruction model (Inversion Model) to map the embeddings back to the original nucleotide sequences ( $A, C, G, T$ ).

B. Models Evaluated

The authors benchmarked three distinct DNA foundation models representing different architectural paradigms:

DNABERT-2: Uses Byte Pair Encoding (BPE) tokenization, creating variable-length tokens.
Evo 2: A large-scale model using single-nucleotide (character-level) tokenization with a StripedHyena architecture.
Nucleotide Transformer v2 (NTv2): Uses 6-mer tokenization with rotary positional embeddings.

C. Embedding Strategies

Two sharing strategies were evaluated:

Per-Token Embeddings: The ordered sequence of individual token vectors is shared, preserving full positional information.
Mean-Pooled Embeddings: Position-specific vectors are averaged into a single fixed-size vector, discarding positional order.

D. Attack Architectures

The authors trained several inversion models to reconstruct sequences from embeddings:

Encoder-only Transformer: Projects embeddings and uses self-attention to predict nucleotides.
Decoder-only Transformer: Uses causal masking for autoregressive generation.
ResNet: A 1D convolutional residual network.
Nearest Neighbour Lookup: A non-parametric baseline returning the training sequence with the closest embedding distance.

E. Datasets and Metrics

Datasets: Human reference genome (hg38) and real patient data from the 1000 Genomes Project.
Metrics:
- Nucleotide Accuracy: Proportion of correctly predicted bases.
- Levenshtein Similarity: Normalized edit distance (accounting for insertions/deletions), biologically interpretable as mutation types.
Collision Analysis: Verified that embeddings are effectively injective (distinct sequences map to distinct vectors), making reconstruction theoretically possible.

3. Key Results

A. Per-Token Embeddings: Near-Perfect Reconstruction

Vulnerability: Per-token embeddings offer virtually no privacy protection.
Performance: Across all three models, simple Multi-Layer Perceptrons (MLPs) achieved near-perfect reconstruction.
- NTv2: ~99% nucleotide accuracy (near-perfect reconstruction).
- Evo 2 & DNABERT-2: ~80% exact sequence matches.
Conclusion: Sharing per-token embeddings is functionally equivalent to sharing raw DNA sequences.

B. Mean-Pooled Embeddings: Partial but Significant Leakage

General Trend: Reconstruction quality degrades as sequence length increases due to information loss from averaging, but remains substantially above random baselines.
Model-Specific Vulnerabilities:
- Evo 2: Most vulnerable for short sequences ( $l=15-20$ ), achieving >90% Levenshtein similarity. Performance drops for very short sequences ( $l=10$ ) due to causal padding effects in its architecture.
- NTv2: Highly vulnerable, with ~90% Levenshtein similarity for $l=10$ and ~57% for $l=100$ .
- DNABERT-2: Most resilient. Levenshtein similarity hovered around 0.46–0.47 (comparable to the Nearest Neighbour baseline) across all lengths.
Tokenization Impact: The resilience of DNABERT-2 is attributed to its BPE tokenization. Variable-length tokens introduce ambiguity; a single token-level error can cascade into insertions or deletions, disrupting the alignment of subsequent nucleotides. In contrast, Evo 2 and NTv2 use fixed-length tokenization, making the inversion mapping more deterministic.

C. Correlation as a Predictor

A strong Spearman correlation between pairwise Euclidean distances in embedding space and sequence similarity (Levenshtein distance) was found to be a key predictor of attack success.
Evo 2 showed the highest correlation (0.435 at $l=20$ ), aligning with its peak reconstruction performance. DNABERT-2 showed uniformly weak correlations ( $\leq 0.13$ ), explaining its resilience.

D. Generalization

Attacks trained on the hg38 reference genome generalized effectively to real patient data from the 1000 Genomes Project, confirming that the vulnerability is inherent to the embedding space, not an artifact of the reference genome.

4. Key Contributions

First Comprehensive Benchmark: The study provides the first systematic evaluation of privacy risks in DNA foundation model embeddings under EaaS settings.
Demonstration of Inversion Feasibility: It proves that even compact attack models can reconstruct sensitive genomic data from shared embeddings, particularly when per-token data is shared.
Tokenization as a Privacy Mechanism: The paper identifies tokenization strategy (specifically variable-length BPE vs. fixed-length) as a critical, previously overlooked factor in embedding privacy. BPE introduces structural ambiguity that hinders inversion.
Privacy-Utility Trade-off Analysis: It highlights a counter-intuitive finding: while shorter sequences contain less genomic content, they are easier to invert from mean-pooled embeddings, whereas longer sequences lose more positional information during pooling, offering slightly better (though still insufficient) protection.
Diagnostic Metric: Proposes the correlation between embedding distance and sequence similarity as a lightweight metric to assess privacy risk without running full inversion attacks.

5. Significance and Implications

Urgent Need for Privacy-Aware Design: The findings suggest that current EaaS practices for genomic data are insecure. Sharing per-token embeddings should be avoided entirely.
Re-evaluation of Mean-Pooling: While mean-pooling offers some protection, it is insufficient for short sequences or models with strong embedding-sequence correlations (like Evo 2 and NTv2).
Design Recommendations:
- Foundation model developers should consider tokenization strategies (e.g., BPE) as a potential implicit privacy defense.
- Deployment of genomic FMs requires additional privacy-preserving techniques (e.g., differential privacy, embedding perturbation) before widespread adoption in clinical or collaborative research settings.
Policy Impact: This work challenges the assumption that "embeddings are safe" and calls for rigorous privacy evaluations before genomic foundation models are deployed in shared service environments.