PRISM-G: an interpretable privacy scoring method for assessing risk in synthetic human genome data

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a massive, incredibly detailed library of people's genetic blueprints (their DNA). This library is a goldmine for scientists trying to cure diseases, but it's also a treasure trove of private information. If someone steals a book from this library, they could potentially figure out who that person is, what diseases they might get, or even find their long-lost relatives.

Because of this, scientists are trying to create "Fake DNA Libraries" (synthetic data). These are made-up genetic profiles that look and act just like real ones for research purposes, but they don't belong to any actual person. The hope is that scientists can use these fake libraries without worrying about privacy.

The Problem:
But how do we know the "fake" library is actually safe?
Currently, people check safety by asking: "Is this fake person's DNA close to a real person's DNA?"
The authors of this paper say, "That's not enough!"

Think of it like a game of "Find the Imposter" in a crowded room:

The Old Way: You just check if the imposter is standing right next to a real person. If they are far away, you think, "Safe!"
The Reality: Even if the imposter is far away, they might still be dangerous if:
- They are wearing the exact same family heirloom (recreating family structures).
- They have a very rare birthmark that only one other person in the world has (rare genetic traits).
- They are wearing a shirt that says "I was in the training class" (membership inference).

If the "fake" library accidentally copies these specific details, a clever hacker could still figure out who the real people are, even if the fake people look different on the surface.

The Solution: PRISM-G (The Privacy Scorecard)

The authors created a new tool called PRISM-G. Think of PRISM-G as a high-tech security scanner or a privacy report card for these fake DNA libraries. Instead of just one simple check, it looks at the data through three different "lenses" to give a single, easy-to-understand score from 0 to 100.

Here is how the three lenses work, using simple analogies:

1. The "Proximity" Lens (PLI): The "Too Close for Comfort" Test

What it checks: Does any fake person look suspiciously similar to a real person?
The Analogy: Imagine you are looking for a specific person in a crowd. If you find a fake person standing within arm's reach of the real person, that's a red flag. PRISM-G checks if the fake data is "too close" to real data in a mathematical sense.
The Score: If they are too close, the risk goes up.

2. The "Kinship" Lens (KRI): The "Family Reunion" Test

What it checks: Did the fake data accidentally recreate family trees or weirdly close relationships that shouldn't exist?
The Analogy: Imagine you are making a fake photo album of a family. If the fake photos accidentally show a "cousin" who is actually a twin, or if the fake family has way too many people who look like they are related, a hacker could use genealogy websites to trace back to the real family. PRISM-G checks if the fake data is "replaying" real family secrets.
The Score: If the fake data has too many "fake relatives," the risk goes up.

3. The "Trait" Lens (TLI): The "Rare Birthday" Test

What it checks: Does the fake data contain rare genetic quirks that act like a fingerprint?
The Analogy: Imagine a room full of people. Most have brown eyes. But one person has a very rare eye color. If your "fake" library accidentally includes a person with that exact rare eye color, and you know that only one real person in the world has it, you've just identified them! PRISM-G checks for these rare, unique "fingerprint" traits.
The Score: If the fake data has too many rare, unique traits, the risk goes up.

How the Score Works

PRISM-G takes the results from these three tests and combines them into a single number: 0 to 100.

0–50 (Green Zone): Safe! The fake data looks good, and the risk of identifying real people is low.
50–90 (Amber Zone): Caution. There are some leaks. You might be able to identify some people or family groups.
90–100 (Red Zone): Dangerous! The fake data is too similar to the real thing. It's basically a leak.

What They Found

The researchers tested three different ways of making fake DNA (called GANs, RBMs, and Genomator):

The "GAN" method: Was generally the safest. It created fake data that was diverse enough to hide real people.
The "RBM" method: Was the riskiest. It accidentally "memorized" too many rare details and family connections, making it easy to identify real people.
The "Genomator" method: Was in the middle. Its safety depended on how strictly it was programmed. If you told it to be very strict, it was safer; if you let it be loose, it became riskier.

Why This Matters

This paper is like giving governments and hospitals a standardized safety rating for their data.

Before, they might have said, "Our fake data is safe because it's not identical to real data."
Now, with PRISM-G, they can say, "Our fake data has a Score of 35. It passed the proximity test, the family test, and the rare-trait test. It is safe to share with researchers in other countries."

This helps build trust. It allows scientists to share life-saving genetic data across borders without worrying that they are accidentally exposing the private lives of the people who donated their DNA. It turns a complex, scary math problem into a simple, understandable traffic light system.

1. Problem Statement

The increasing availability of large-scale genomic data (e.g., biobanks) is crucial for precision medicine but poses significant privacy risks due to the inherent re-identifiability of human genomes. While synthetic genomic data is proposed as a solution to enable data sharing without exposing real individuals, current evaluation methods are insufficient.

Limitations of Current Metrics: Existing approaches rely heavily on simple similarity metrics (e.g., nearest-neighbor distance or Hamming distance). These fail to capture complex leakage pathways, such as the preservation of familial structures, population-level correlations, or rare-variant uniqueness.
The Gap: There is a lack of a unified, interpretable, and domain-aware framework to quantify privacy risk across different synthetic data generators. This hinders cross-border data sharing (particularly in Europe under GDPR/EHDS) due to a lack of standardized evidence regarding re-identification risks.
Specific Risks: Privacy leakage can occur not just through direct genome matching, but through:
1. Proximity: Synthetic genomes being unusually close to real individuals in genetic coordinate space.
2. Kinship Replay: Synthetic data inadvertently recreating family structures or long-range dependencies (e.g., cousins, shared haplotypes).
3. Trait-Linked Exposure: Rare variants or membership inference signals that allow individuals to be singled out or their traits inferred.

2. Methodology: The PRISM-G Framework

The authors introduce PRISM-G (Privacy Risk Integrated Score for Multi-representation Genomes), a model-agnostic framework that quantifies privacy exposure across three complementary genomic representations. The framework outputs a single, interpretable 0–100 risk score.

A. Three Core Components

PRISM-G decomposes risk into three indices, each normalized to a [0, 1] scale:

Proximity Leakage Index (PLI):
- Concept: Measures if synthetic genomes lie unusually close to real "holdout" genomes in a low-dimensional genetic coordinate space (derived via PCA).
- Mechanism: It compares the distribution of distances between synthetic-real pairs against a baseline of real-real pairs. It uses a lower-tail quantile ratio and an adversarial nearest-neighbor check to detect over-closeness that exceeds normal population structure.
Kinship Replay Index (KRI):
- Concept: Detects if the synthetic cohort reproduces familial relationships or population-level correlation structures present in the training data.
- Mechanism: It aggregates four signals based on Genetic Relationship Matrices (GRM):
  - Replay: Similarity in the distribution of close-kin pairs (Jensen-Shannon divergence).
  - Internal Kinship Excess: Surplus of relatedness compared to a safe baseline.
  - Micro-haplotype Collisions: Excess reuse of short genotype patterns.
  - Spectral Inflation: Concentration of relatedness detected via the leading eigenvalue of the GRM.
Trait-Linked Leakage Index (TLI):
- Concept: Assesses exposure via rare variants and membership inference attacks (MIA).
- Mechanism:
  - MIA Performance: Evaluates if a classifier can distinguish training data from holdout data based on rare variant burdens.
  - Uniqueness/Rarity Match: Checks for the co-occurrence of rare variants in the synthetic set that is statistically unlikely under Hardy-Weinberg equilibrium, indicating potential "singling out."

B. Aggregation and Calibration

Aggregation: The three component scores are combined using a risk-averse "OR-like" aggregator (a noisy-OR function). This ensures that a high risk in any single dimension cannot be masked by low risks in others.
Calibration: To make scores comparable across datasets, PRISM-G uses two reference anchors:
1. Safe Baseline ( $A_{Safe}$ ): A binomial sampler preserving allele frequencies but removing all linkage disequilibrium and kinship.
2. Leaky Baseline ( $A_{Leaky}$ ): A "copycat" generator that intentionally overfits structure and duplicates individuals.
Scoring: Raw aggregated risks are linearly mapped to a 0–100 scale based on these anchors, with weights optimized via ridge regression to maximize ranking stability.

3. Key Contributions

Multi-Dimensional Risk Assessment: Moves beyond simple distance metrics to evaluate privacy through three distinct, biologically relevant lenses (coordinate proximity, kinship structure, and trait/rare-variant uniqueness).
Interpretable Scoring: Provides a standardized 0–100 score with qualitative bands (Green/Safe, Amber/Leaky, Red/Risky) and sub-metrics that explain why a model is risky.
Model-Agnostic Evaluation: Successfully applied to diverse generative models:
- GANs (Generative Adversarial Networks)
- RBMs (Restricted Boltzmann Machines)
- Genomator (Logic-based SAT-solver approach)
Privacy-Utility Trade-off Analysis: Demonstrates how to plot generators on a Pareto frontier, balancing privacy risk (PRISM-G score) against analytical utility (ancestry inference accuracy).

4. Results

The framework was evaluated on synthetic cohorts generated from the 1000 Genomes Project (1KGP) using two SNP panels (10,000 SNPs and 65,535 SNPs).

Generator Performance:
- GANs: Consistently showed the lowest privacy risk (Green/Amber range). They effectively avoided proximity leakage and rare-variant collisions while maintaining moderate utility.
- RBMs: Showed the highest privacy risk (Amber/Red range). They tended to "memorize" rare variant patterns and family structures, leading to high TLI and KRI scores.
- Genomator: Risk was highly parameter-dependent. Under tight constraints (low Hamming distance), proximity leakage was high. As constraints were relaxed, proximity risk dropped, but spectral relatedness remained a factor.
Component Analysis:
- Different generators failed on different axes. For example, RBMs failed primarily on Trait-Linked Leakage (rare variants), while Genomator (under tight constraints) failed on Proximity Leakage.
- This confirms that a single similarity metric is insufficient; a model can be "far" from real genomes in Euclidean space but still leak kinship or trait information.
Stability: The ranking of generators (GAN < Genomator < RBM in terms of safety) remained robust across different hyperparameter settings and bootstrap replicates (Kendall's $\tau$ > 0.89).
Utility: All generators maintained high utility (>90%) for ancestry inference, proving that high privacy risk (in RBM's case) does not necessarily correlate with better utility.

5. Significance and Implications

Regulatory Compliance: PRISM-G provides the "technical evidence" required by regulations like the GDPR and the European Health Data Space (EHDS) to justify data sharing. It moves the conversation from "is it similar?" to "what specific risks are present?"
Governance & Trust: By decomposing risk, stakeholders can make informed decisions. For instance, if a dataset has high KRI, specific mitigation strategies (e.g., pedigree pruning) can be applied rather than discarding the entire dataset.
Equity: The framework highlights that privacy risks are not uniform; they can disproportionately affect underrepresented populations due to rare-variant uniqueness, supporting more equitable data governance.
Future Directions: The authors suggest extending TLI to specific phenotypes and applying PRISM-G to whole-genome data with diverse ancestries to further stress-test synthetic data generators.

In conclusion, PRISM-G establishes a new standard for evaluating synthetic genomic data, shifting the paradigm from binary "safe/unsafe" judgments to a nuanced, interpretable, and actionable risk assessment framework.