A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a doctor trying to predict who might get sick in the future based on their DNA. You have a massive toolbox containing 46 different "risk calculators" (called Polygenic Risk Scores, or PRS). Each calculator claims to be the best at predicting diseases like heart issues, depression, or asthma.

However, these calculators are all built differently. Some need a huge amount of data, some run slowly, some crash if the data isn't perfect, and some are great at predicting height but terrible at predicting heart disease. Until now, comparing them was like trying to compare a Ferrari, a bicycle, and a boat by seeing who wins a race on a track—they all have different rules, requirements, and strengths.

What this paper did:
The authors built a standardized "testing track" to race all 46 calculators fairly. They didn't just look at who won; they looked at how fast they ran, how much fuel (computer memory) they used, and whether they broke down when the road got bumpy.

Here is the breakdown of their findings using simple analogies:

1. The "One-Size-Fits-All" Myth

The Analogy: Imagine you are buying a shoe. You might think, "I'll just buy the most expensive shoe; it must be the best for everything."
The Reality: The study found that there is no single "best" calculator.

If you want to predict Height, one specific calculator (LDAK-GWAS) was the clear winner.
If you want to predict Asthma, a different calculator (LDpred-2-Grid) took the crown.
If you want to predict Depression, yet another one (LDAK-GWAS) was best.

The Lesson: You can't just pick one tool and use it for every disease. You have to pick the right tool for the specific job, just like you wouldn't use a snow shovel to dig a garden hole.

2. The "Helper" vs. The "Solo Act"

The Analogy: Imagine trying to guess if it will rain.

The Null Model: You just look at the sky (covariates like age and sex).
The PRS Model: You look at the sky plus a complex weather satellite (the DNA score).
The Full Model: You look at the sky, the satellite, and a team of meteorologists (covariates + DNA).

The Reality: The study tested if adding the DNA score actually helped.

For Height, adding the DNA score was like adding a supercomputer to a basic calculator—it made a huge difference.
For some other conditions, like Gastro-Reflux, the DNA score didn't add much value because the "weather" (other factors) was already so easy to predict.
Crucial Point: The study showed that sometimes a calculator looks "good" only because it's working with a team of experts (covariates). If you take the team away, the calculator might look much weaker.

3. The "Fragile vs. Tough" Tools

The Analogy: Imagine a race where some cars are high-performance sports cars that need perfect fuel and a smooth track, while others are rugged trucks that can handle mud and rocks but are slower.
The Reality: The study didn't just measure accuracy; they measured operational complexity (how hard the tool is to use).

The Rugged Trucks: Tools like PRSice-2 and Lassosum were the winners here. They were accurate, didn't crash often, and didn't need massive amounts of computer memory. They are the "workhorses" you can trust in the real world.
The Sports Cars: Some tools were incredibly accurate but required so much computer power, took hours to run, or crashed if the data had even one tiny error. These are great for research labs with supercomputers but might be too fragile for a busy hospital.
The Broken Cars: Some tools simply refused to run on certain data types, crashing immediately.

4. The "Tuning Knob" Problem

The Analogy: Imagine a radio. If you don't tune the frequency correctly, you get static.
The Reality: Many of these calculators have "knobs" (hyperparameters) that need to be turned to get the best result.

The study found that for many tools, the p-value threshold (a setting that decides which genetic clues to include) was the most important knob.
If you didn't tune this knob correctly, even the best calculator would perform poorly.
The Takeaway: You can't just use the "default settings." You have to tune the tool for the specific job, or you might be driving a car with the parking brake on.

5. The "Twin" Effect

The Analogy: Imagine two different brands of coffee makers. They look different and cost different amounts, but they brew coffee that tastes exactly the same.
The Reality: The researchers looked at the "DNA fingerprints" (effect sizes) these tools produced. They found that many tools that use similar math (like the LDpred family) produce almost identical results.

This means if you are choosing between two tools that are "twins," you should pick the one that is easier to install and runs faster, because the accuracy will be the same.

The Big Picture Conclusion

This paper is like a Consumer Reports guide for genetic risk calculators.

No Magic Bullet: There is no single tool that wins every time.
Context Matters: A tool's success depends on the disease you are studying and the data you have.
Practicality Counts: The "best" tool isn't just the most accurate one; it's the one that doesn't crash, runs fast, and is easy to install.
Transparency: The authors built a free, open-source "track" (framework) so that other scientists can test new tools fairly in the future, ensuring that the tools we use in hospitals are actually reliable.

In short: If you want to predict genetic risk, don't just grab the first calculator you see. Check the "Consumer Reports" (this study), pick the right tool for your specific disease, make sure your computer can handle it, and tune the settings carefully.

1. Problem Statement

Polygenic Risk Score (PRS) tools have proliferated due to increasing GWAS sample sizes, resulting in a heterogeneous landscape of over 40 methods. These tools differ significantly in:

Statistical assumptions: (e.g., Linear Mixed Models, Bayesian variable selection, penalized regression).
Input requirements: (e.g., summary statistics only vs. individual-level genotype data, reference panels).
Implementation complexity: (e.g., runtime, memory usage, dependency management).

Current Gaps:

Lack of Standardization: Existing benchmarks often evaluate limited subsets of tools using inconsistent preprocessing, validation strategies, and hyperparameter settings, making direct comparison difficult.
Implementation Blindness: Most studies focus solely on predictive performance (AUC or $R^2$ ), ignoring operational constraints like installation complexity, runtime, memory footprint, and failure modes under real-world data constraints.
Confounding Factors: Benchmarks often fail to separate the predictive contribution of the PRS from that of covariates (e.g., age, sex, metabolomic biomarkers), obscuring whether performance gains are due to the genetic model or the covariate structure.

2. Methodology

The authors developed a harmonized, implementation-aware benchmarking framework executed on High-Performance Computing (HPC) infrastructure.

Data and Scope

Tools Evaluated: 46 distinct PRS tools.
Phenotypes:
- 7 Binary Phenotypes: Derived from the UK Biobank (Asthma, Depression, Gastro-Reflux, High Cholesterol, Hypothyroidism, Irritable Bowel Syndrome, Migraine).
- 1 Continuous Phenotype: Height (using an independent public tutorial dataset).
Covariates: The binary models included a rich set of 135 NMR metabolomic biomarkers and comorbid conditions, alongside standard covariates (age, sex, principal components).

Benchmarking Workflow

Standardized Preprocessing: GWAS summary statistics and genotype data were harmonized using GWASPokerforPRS. Strict quality control (QC) filters were applied (MAF > 0.01, INFO > 0.8, HWE, missingness thresholds).
Tool Execution: Tools were grouped by input requirements (summary stats only, genotype only, or both) and executed under standardized workflows.
Model Configurations: Each phenotype was evaluated under three distinct models to isolate effects:
- Null Model: Covariates + PCs only.
- PRS-only Model: PRS only.
- Full Model: PRS + Covariates + PCs.
Validation Strategy:
- 5-Fold Cross-Validation: To prevent information leakage.
- Hyperparameter Search: Explored 20 logarithmically spaced p-value thresholds, clumping/pruning settings, and method-specific parameters.
- Selection Rule: A $\delta$ -constrained selection rule was applied to choose the best configuration. It selected the configuration with the highest combined train/test performance where the train-test gap was below a threshold ( $\delta = 0.05$ for AUC, $0.03$ for $R^2$ ). This prioritizes stability and reduces overfitting compared to training-only selection.
Performance Metrics:
- Binary: Area Under the Receiver Operating Characteristic Curve (AUC).
- Continuous: Explained Variance ( $R^2$ ).
Operational Metrics: Runtime, memory consumption, input dependencies, and failure rates were tracked.

Statistical Analysis

Friedman Test: Used to detect significant differences in tool rankings across 40 phenotype-fold combinations.
Nemenyi Post-hoc Test: For pairwise comparisons.
Sensitivity Analysis: Compared the $\delta$ -constrained rule against training-only and alternative stability thresholds to assess overfitting.

3. Key Contributions

Reproducible Framework: A unified pipeline standardizing data preparation, execution, and evaluation for 46 diverse PRS tools.
Implementation-Aware Evaluation: Moves beyond pure accuracy to include runtime, memory, failure modes, and installation complexity, providing a "practical utility" score.
Decoupled Model Analysis: Explicitly separates the predictive value of PRS from covariates by reporting Null, PRS-only, and Full model results.
Comprehensive Failure Taxonomy: Systematically categorizes tool failures (e.g., dependency issues, SNP overlap, heritability estimation errors) rather than treating them as missing data.
Hyperparameter Sensitivity: Identifies which hyperparameters (e.g., p-value thresholds, number of variants) most strongly drive performance across methods.

4. Key Results

Predictive Performance

No Universal Winner: No single tool performed best across all phenotypes. Performance is highly phenotype-dependent.
- Height: LDAK-GWAS performed best ( $R^2 = 0.353$ ).
- Asthma: LDpred-2-Grid ($AUC = 0.629$).
- Depression: LDAK-GWAS ($AUC = 0.664$).
- High Cholesterol: PRSice-2 ($AUC = 0.932$).
PRS Utility: PRS significantly improved prediction over null models for Height, Depression, Asthma, High Cholesterol, IBS, and Migraine. However, gains were minimal or inconsistent for Gastro-Reflux and Hypothyroidism, likely due to the strong predictive power of the rich covariate set.
Statistical Significance: Global rankings showed significant differences ( $\chi^2 = 102.29, p = 2.57 \times 10^{-11}$ ). LDpred-2-Lassosum2 had the most consistent average rank, followed by PRSice-2 and LDAK-GWAS.

Operational Complexity & Failure

Four Tool Profiles: Tools were clustered into four quadrants based on performance vs. operational complexity:
1. High Performance / Low Complexity: (e.g., C+T, XP-BLUP, LDpred-2-Lassosum2).
2. High Performance / High Complexity: (e.g., PRSice-2, LDAK-GWAS, GEMMA-LMM).
3. Low Performance / Low Complexity: (e.g., VIPRS-Simple, GCTA).
4. Low Performance / High Complexity: (e.g., BOLT-LMM, PleioPred, NPS).
Failure Rates: Many tools failed on specific phenotypes due to strict input requirements (e.g., NPS and CTPR requiring exact SNP matching) or resource constraints.
Runtime/Memory: Bayesian methods had the highest mean runtime (1.95h) and memory (2.85GB). Multi-trait methods were fastest (0.65h) but memory-intensive (3.39GB).

Hyperparameter Sensitivity

Key Drivers: The GWAS p-value threshold and the number of variants included were the most influential hyperparameters across tools.
Overfitting: Tools relying on full genotype-based LD modeling (e.g., GEMMA-LMM, MTG2) showed significant rank deterioration when evaluated using a "training-only" rule, indicating high susceptibility to overfitting without stability constraints. Summary-statistics-based tools (e.g., PRSice-2, Lassosum) were robust across selection strategies.

Effect Size Similarity

Hierarchical clustering of SNP effect sizes revealed that tools with similar methodological families (e.g., LDpred variants) produce highly correlated beta estimates, suggesting redundancy. Conversely, tools with similar accuracy but divergent effect-size profiles may capture different genetic architectures.

5. Significance and Conclusion

This study provides a critical resource for the genomics community by shifting the paradigm from "which tool is most accurate?" to "which tool is most appropriate for a specific context?"

Practical Guidance: Researchers can now select tools based on a trade-off between predictive accuracy and operational feasibility (runtime, memory, robustness).
Methodological Rigor: The framework demonstrates that benchmarking must account for hyperparameter sensitivity and covariate structures to avoid misleading conclusions about a tool's genetic predictive power.
Future Directions: The authors highlight the need for future benchmarks to include diverse ancestry groups and larger cohorts to test transferability.

The framework, code, and documentation are publicly available, serving as a reproducible foundation for future comparative PRS studies.