Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools

Imagine your DNA as a massive, ancient library containing billions of books (genes). Most of these books are identical for everyone, but some pages have tiny typos called SNPs (Single Nucleotide Polymorphisms). These typos are what make you different from your neighbor—why you have blue eyes, why you might get a headache, or why you love spicy food.

The big question scientists have been asking is: Can we read these typos to predict who will get sick or what traits they have?

This paper is like a massive "Taste-Test Competition" where researchers tried to see which "reader" (computer algorithm) is best at guessing a person's traits based on their genetic library. They used a public dataset called openSNP, which is like a community library where people voluntarily shared their genetic data and personal habits.

Here is the breakdown of the competition:

The Three Contenders

The researchers set up a race between three different types of "readers" to see who could predict 80 different traits (like "Do you have diabetes?" or "Do you enjoy riding a motorbike?") most accurately.

The Traditional Accountant (Polygenic Risk Scores - PRS):
- How it works: This method is like a traditional accountant. It looks at a list of known typos, adds up their "risk points" based on past studies, and gives you a final score. If your score is high, you are likely to have the trait.
- The Vibe: Reliable, old-school, and based on established rules. It's like using a map drawn by previous explorers.
The Sharp Detective (Machine Learning - ML):
- How it works: This is like a detective who looks for patterns. Instead of just adding up points, it looks for complex clues and connections between different typos that a human might miss. It learns from the data to spot hidden relationships.
- The Vibe: Smart, adaptable, and good at finding non-obvious connections.
The Deep Thinker (Deep Learning - DL):
- How it works: This is the detective's super-powered cousin. It uses artificial neural networks (simulating a brain) to dig even deeper. It can handle massive amounts of data and find incredibly complex, multi-layered patterns that the other two might miss.
- The Vibe: High-tech, powerful, but sometimes a bit of a "black box" (hard to explain how it reached the conclusion).

The Race Results

The researchers ran these three methods against each other on 80 different traits. Here is what they found:

The Scoreboard: It was a very close race!
- Machine Learning/Deep Learning won for 44 traits.
- The Traditional Accountant (PRS) won for 36 traits.
The Winners:
- For complex diseases like Type 2 Diabetes or Depression, the "Deep Thinkers" (Deep Learning) and "Sharp Detectives" (Machine Learning) were often better. They could handle the messy, complicated nature of these conditions.
- For physical traits like Bone Mineral Density or Restless Leg Syndrome, the "Traditional Accountant" (PRS) often did a better job. These traits seem to follow clearer, more predictable rules.
The Losers: One specific tool called PRSice (a type of PRS calculator) struggled significantly, often performing worse than the others. It's like bringing a bicycle to a car race.

The "Motorbike" Surprise

One of the most interesting findings was about traits that aren't really "biological" in the traditional sense.

The algorithms did a great job predicting things like Eye Color or Diseases.
But they failed miserably at predicting things like "Do you enjoy riding a motorbike?" or "Do you like fishing?"
The Lesson: This makes sense! Your genes might decide if you have the muscles to ride a bike, but they don't decide if you like it. Those are choices and preferences shaped by your environment, not your DNA. The computer knew it couldn't guess your hobbies just by reading your genes.

Why This Matters

Think of this study as a guidebook for future doctors and researchers.

It's not "One Size Fits All": You can't just use one tool for every disease. If you are trying to predict a complex disease, you might need the "Deep Thinker" (Deep Learning). If you are looking at a simpler physical trait, the "Traditional Accountant" (PRS) might be faster and just as good.
Data is King: The study used a relatively small library (openSNP) compared to massive medical databases. Even with limited data, these smart algorithms found good patterns. This suggests that even for rare diseases or smaller populations, we can start making predictions without needing millions of people.
The Future: The researchers are now thinking about combining these tools—using the "Detective" to find the clues and the "Accountant" to calculate the final risk—to create the ultimate prediction engine for precision medicine.

In short: We are getting better at reading our genetic library. Sometimes the old maps work best, but sometimes we need a super-computer to find the hidden treasure. And sometimes, the computer just has to admit, "I can't guess your hobbies!"

Here is a detailed technical summary of the paper "Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools."

1. Problem Statement

The prediction of phenotypes from genotypes is a cornerstone of precision medicine, aiming to identify disease risks and biological traits based on Single Nucleotide Polymorphisms (SNPs). While Genome-Wide Association Studies (GWAS) and Polygenic Risk Scores (PRS) have been traditional methods, they often struggle with rare mutations, complex non-linear interactions, and population-specific biases. Conversely, Machine Learning (ML) and Deep Learning (DL) offer the potential to capture high-dimensional, non-linear relationships but require extensive benchmarking to determine their efficacy compared to established PRS tools.

The study addresses the gap in comprehensive comparisons between ML/DL algorithms and PRS tools across a wide range of phenotypes using the openSNP dataset. A specific challenge highlighted is the "Direct-to-Consumer" (DTC) nature of openSNP data, which often lacks detailed demographic metadata (e.g., precise population stratification, gender) and suffers from missing values, making standard PRS calculations difficult.

2. Methodology

Data Source and Pre-processing

Dataset: The study utilized the openSNP dataset, containing genotype data from DTC providers (23andMe, AncestryDNA, FamilyTreeDNA) and phenotype self-reports.
Cleaning: The authors processed 6,401 genotype files and 668 phenotypes. They focused on binary phenotypes (Case/Control).
- Transformation: Phenotype labels were manually unified (e.g., mapping "Right-handed," "Right," and "R" to a single "Case" value).
- Filtering: Phenotypes with excessive missing data or unbalanced class distributions were discarded.
- Final Scope: 80 binary phenotypes were retained for analysis, with sample sizes ranging from as few as 14 (Hypertension) to over 290 (Colour Blindness).
Format Conversion: All genotype files were converted to the PLINK format (bed, bim, fam).

Quality Control (QC)

Genotype QC: Applied standard filters: Minor Allele Frequency (MAF) > 0.01, Hardy-Weinberg Equilibrium (HWE) p-value > 1e-6, genotype rate > 0.99, and individual missingness < 0.3.
Data Splitting: A stratified 5-fold cross-validation approach was used, splitting data into 80% training and 20% testing sets.

Modeling Approaches

The study benchmarked three distinct pipelines:

A. Machine Learning (ML) & Deep Learning (DL)

Feature Selection: Instead of using all SNPs, the authors used p-value thresholding on the training set (via Fisher's exact test) to select subsets of SNPs (50, 100, 200, 500, 1,000, 5,000, and 10,000).
ML Algorithms: 29 algorithms from the scikit-learn library were tested, including tree-based methods (XGBoost, Random Forest, AdaBoost), SVMs, and linear models (SGD, Logistic Regression).
DL Architectures: 80 variants were generated using four base architectures:
- Artificial Neural Networks (ANN)
- Recurrent Neural Networks (GRU, LSTM, Bidirectional LSTM)
- Hyperparameter Tuning: Variations included Dropout (0.2, 0.5), Optimizer (Adam), Batch Size (1, 5), and Epochs (50, 200).
- Input Adaptation: The number of neurons in hidden layers was dynamically adjusted based on the number of input SNPs ( $S$ ) using the formula: $128 + 2\sqrt{S}$, etc., to handle varying feature dimensions.

B. Polygenic Risk Scores (PRS)

Tools: Three tools were benchmarked: PLINK, PRSice2, and Lassosum.
Base File Generation: Unlike studies using external GWAS summary statistics, the authors generated GWAS summary statistics (p-values, Odds Ratios) directly from the openSNP training data to mitigate ancestry mismatches.
Parameter Grid: A massive grid search was performed with 675 combinations of clumping and pruning parameters:
- Pruning: Window size (200, 500, 1000), Shift (50, 100, 150), LD threshold (0.1, 0.3, 0.5).
- Clumping: P-value threshold (1), $r^2$ threshold (0.1–0.9), Physical distance (200–1000 kb).

Evaluation Metric

Area Under the Curve (AUC): Used as the primary metric due to class imbalance in many phenotypes.
Normalization: PRS scores were normalized (Min-Max) and converted to binary predictions (threshold 0.5) to calculate AUC comparable to ML/DL outputs.

3. Key Contributions

Comprehensive Benchmarking: The first large-scale study to compare 29 ML algorithms, 80 DL variants, and 3 PRS tools across 80 distinct phenotypes.
Hyperparameter Optimization: Systematic exploration of 675 PRS parameter combinations and 8 DL hyperparameter sets per architecture, identifying optimal configurations for specific traits.
Data Pipeline for DTC Data: Developed a robust pipeline to handle the specific limitations of openSNP (missing metadata, inconsistent labeling, low sample sizes) by generating internal GWAS summary statistics rather than relying on external European-centric datasets.
Phenotype-Specific Insights: Demonstrated that no single method dominates all phenotypes; performance is highly trait-dependent.

4. Key Results

Overall Performance

ML/DL Superiority: ML/DL algorithms outperformed PRS tools for 44 phenotypes.
PRS Superiority: PRS tools outperformed ML/DL for 36 phenotypes.
Best Performers:
- ML: XGBoost was the top performer (best for 11 phenotypes), followed by Decision Trees and Passive Aggressive Classifiers.
- DL: Artificial Neural Networks (ANN) were the most successful DL architecture (best for 26 phenotypes). Recurrent networks (LSTM/GRU) showed mixed results, excelling in specific cases like Hypertension and Type 2 Diabetes.
- PRS: PLINK was the best PRS tool (best for 25 phenotypes). PRSice2 generally performed the worst, likely due to its handling of missing data in low-genotype-rate datasets.

Specific Phenotype Findings

High-Performance ML/DL: Fibromyalgia (96.6% AUC), Plantar Fasciitis (93.2%), Craves Sugar (84.8%), and Eczema (83.2%) were best predicted by ML/DL.
High-Performance PRS: Bone Mineral Density (87.44% AUC), Wide feet (89.17%), and Restless leg syndrome (83.81%) were best predicted by PLINK.
Complex Traits: Traits like Type 2 Diabetes and Migraine required thousands of SNPs for optimal ML performance, suggesting complex polygenic architectures.
Low-Genetic Influence: Preferences like "Enjoy riding a motorbike" and "Sport interest" showed low AUCs across all methods, suggesting these are driven more by environmental factors than genetics.

Parameter Sensitivity

PRS: Optimal PRS performance often relied on specific clumping thresholds (e.g., $r^2$ = 0.1) and pruning window sizes (200).
DL: The best DL performance was achieved with Dropout = 0.2, Optimizer = Adam, Batch Size = 1, and Epochs = 50.

5. Significance and Limitations

Significance:

Precision Medicine: The study validates that for populations with limited data (like openSNP), brute-force hyperparameter tuning of ML/DL models can yield competitive or superior results compared to traditional PRS, aiding in early disease screening.
Methodological Guidance: It provides a "best practice" guide: use ANN with 5 layers for general DL tasks, XGBoost for ML, and PLINK with default clumping for PRS when starting a new phenotype analysis.
Transfer Learning Potential: The work supports the use of transfer learning and model adaptation for under-studied populations where large-scale GWAS summary statistics are unavailable.

Limitations:

Data Quality: The openSNP dataset lacks rigorous population stratification data (ancestry, gender), which can lead to false positives or reduced model generalizability.
Sample Size: Many phenotypes had very small sample sizes (e.g., <20 cases), limiting the statistical power and reliability of the results for those specific traits.
Interpretability: While DL models performed well, they lack the biological interpretability of PRS, making it difficult to identify specific causal SNPs.
Confounding Factors: The absence of age and other covariates in the openSNP data may have biased the predictions.

Conclusion:
The paper concludes that genotype-phenotype prediction is a complex problem where the optimal algorithm depends heavily on the specific phenotype's genetic architecture and data quality. While PRS remains powerful for highly polygenic traits, ML and DL offer a robust alternative for complex, non-linear traits, particularly in datasets where traditional GWAS summary statistics are unavailable or mismatched.