Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to bake the perfect cake (a Polygenic Risk Score, or PRS) to predict how likely someone is to get a specific disease. To bake this cake, you need a key ingredient: a precise measurement of how much "genetic spice" contributes to the disease. This measurement is called Heritability ( $h^2$ ).

For years, scientists have been arguing about the best way to measure this "genetic spice." Some use a digital scale, others use a spring scale, and some use a balance beam. The problem? They all give you different numbers. One tool might say the spice makes up 10% of the cake, while another says 20%, and a third might even say "-5%" (which sounds impossible, like a negative amount of sugar!).

This paper, titled "Benchmarking Heritability Estimation Strategies," is like a massive, scientific taste test. The researchers wanted to answer two big questions:

Why do these tools give such different numbers?
Does it actually matter if you use the "wrong" number when you bake the final cake?

Here is the breakdown of their findings, explained simply:

1. The "Ruler" Problem: Why the numbers are all over the place

The researchers tested 86 different ways (configurations) to measure heritability using data from 10 different health conditions (like asthma, depression, and high cholesterol) from the UK Biobank.

Think of these 86 ways as 86 different rulers. Some rulers are stretched, some are shrunk, some measure in inches, and some in centimeters.

The Result: The numbers they got were wild. They ranged from -0.86 to +2.73.
The "Negative" Mystery: About 16% of the time, the tools gave a "negative" heritability. In the real world, you can't have negative spice. But in statistics, this just means the tool was looking at a very weak signal and got confused, essentially saying, "I can't find any spice here, and my math is so shaky it looks like there's less than nothing."
The Cause: The biggest reason for the differences wasn't the data itself, but how the scientists set up their tools. Did they include extra variables? Did they clean the data in a specific way? Did they use a specific algorithm? Changing these settings was like changing the ruler's calibration.

The Analogy: It's like asking 86 different people to measure the height of a building. If one person stands on a chair, another uses a tape measure that is stretched, and a third guesses based on shadows, they will all get different numbers. The building didn't change; the method changed.

2. The Big Surprise: The Cake Still Tastes the Same

This is the most important part of the paper. Usually, if you use the wrong amount of sugar in a cake, the cake tastes terrible. The researchers expected that if they used a "wrong" heritability number to build their risk prediction model (the cake), the prediction would fail.

They were wrong.

The Finding: Even though the "spice measurement" (heritability) varied wildly, the final cake (the risk prediction) tasted almost the same.
The Evidence: When they tested the predictions against real people, the accuracy didn't change much, regardless of whether they used a heritability number of 0.05 or 0.50.
The Takeaway: The system is surprisingly robust. It's like baking a cake where the recipe is flexible. Whether you use a cup of sugar or a cup and a half, the cake still turns out delicious. The "downstream" result (predicting disease risk) is not very sensitive to the "upstream" error in measuring heritability.

3. What Should We Do Now?

The paper concludes with some practical advice for scientists and doctors:

Don't treat Heritability as a "Fact": Stop thinking of heritability as a single, unchangeable number like the speed of light. It is more like a setting on a camera. If you change the ISO, the aperture, and the shutter speed (the configuration), you get a different photo (a different number), even if the subject is the same.
Report Your Settings: If you publish a heritability number, you must also publish exactly how you got it. Saying "Heritability is 0.2" is meaningless without saying "We used Tool X, with Method Y, and cleaned the data this way."
Don't Panic Over Negative Numbers: If a tool gives a negative heritability, don't throw the tool in the trash. It just means the tool is "unconstrained" and the signal was weak. It's a valid mathematical output, not necessarily a broken tool.

Summary

Imagine you are trying to navigate a ship (predicting disease risk). You have a compass (heritability estimation) that is slightly broken and spins wildly depending on how you hold it.

Old thinking: "Oh no! The compass is spinning! We can't navigate!"
This paper's finding: "Actually, the ship is so sturdy and the ocean so calm that even with a spinning compass, we still arrive at the right destination."

The Bottom Line: The way we measure genetic influence is messy and depends heavily on our tools, but luckily, our ability to predict disease risk is tough enough to handle that messiness. We just need to be honest about which "ruler" we used.

1. Problem Statement

SNP heritability ( $h^2$ ) is a fundamental parameter in statistical genetics, routinely used as an input for constructing Polygenic Risk Scores (PRS). However, $h^2$ estimates are not fixed constants; they vary significantly based on the estimation strategy, software tool, input data type, preprocessing steps (e.g., clumping, pruning), and genetic relatedness matrix (GRM) construction.

The Gap: While individual heritability estimators have been studied, there is a lack of systematic benchmarking regarding how this upstream variability propagates to downstream PRS performance.
The Question: Does the choice of heritability estimation configuration (out of many possible combinations) significantly alter the predictive accuracy of PRS models? Furthermore, how do negative heritability estimates (often seen in unconstrained models) impact downstream utility?

2. Methodology

The study employed a large-scale, systematic benchmarking framework using UK Biobank data.

Dataset:
- Cohort: 10 phenotypes from the UK Biobank (9 binary, 1 continuous: BMI) restricted to European ancestry.
- Sample Size: Post-QC sample sizes ranged from ~728 to ~733 per phenotype (note: the text mentions specific counts per phenotype in Table 2, though the total N varies slightly by trait availability).
- Genotype Data: ~619,653 SNPs after quality control.
- Validation: 5-fold cross-validation (stratified for binary traits). Heritability was estimated on the training fold (80%) and used to parameterize PRS models, which were then evaluated on the held-out test fold (20%).
Benchmark Design:
- Scope: 86 distinct heritability estimation configurations.
- Tools: Six tool families: GEMMA, GCTA, LDAK, DPR, LDSC, and SumHer.
- Method Groups: Ten distinct method groups (e.g., GEMMA with HE regression vs. REML, LDSC with external vs. internal reference panels, LDAK with precomputed vs. calculated tagging).
- Variables: Configurations varied by algorithm (REML, HE regression), GRM type (centred vs. standardised), input data (genotype vs. GWAS summary stats), covariate inclusion (PCA), and variant filtering (clumping/pruning).
Downstream PRS Evaluation:
- Two PRS frameworks were tested: GCTA-SBLUP and LDpred2-lassosum2.
- Models: Three models were fitted per configuration: (1) Null (covariates only), (2) PRS-only, and (3) Full (PRS + covariates).
- Metrics: AUC for binary traits; $R^2$ for BMI.
Statistical Analysis:
- Mann–Whitney U tests to assess the impact of 11 binary hyperparameter contrasts on $h^2$ .
- Pearson/Spearman correlations to test the association between $h^2$ magnitude and PRS performance.
- Friedman tests to compare method rankings across phenotypes.

3. Key Results

A. Extreme Variability in Heritability Estimates

Range: Across 844 configuration-level estimates, $h^2$ ranged from −0.862 to 2.735 (Mean = 0.134, SD = 0.284).
Negative Estimates: 15.8% (133/844) of estimates were negative. These were concentrated in unconstrained estimators (DPR+GEMMA, GEMMA, LDSC) using Haseman–Elston (HE) regression or LD score regression under low signal-to-noise conditions. GCTA (constrained REML) produced 0% negative estimates.
Drivers of Variance: Ten of eleven analytical contrasts significantly affected $h^2$ $h^{2}$ magnitude. The largest effects were driven by:
- Algorithm Choice: REML-based configurations yielded significantly higher $h^2$ (mean 0.252) than HE regression (mean 0.043).
- GRM Standardisation: Standardised GRMs yielded higher estimates than centred GRMs.
- Variant Filtering: Clumping and pruning reduced $h^2$ estimates significantly.

B. Weak Coupling to Downstream PRS Performance

Despite the massive upstream variability in $h^2$ , the downstream PRS performance was remarkably robust:

Correlation: The pooled Pearson correlation between $h^2$ $h^{2}$ magnitude and test AUC was negligible for both frameworks:
- GCTA-SBLUP: $r = -0.023$ ( $p=0.188$ )
- LDpred2-lassosum2: $r = +0.014$ ( $p=0.427$ )
Stability: For binary phenotypes, changing the heritability configuration typically shifted test AUC by less than 0.07 units.
Method Ranking: No single heritability estimation family consistently outperformed others in downstream PRS accuracy. The Friedman test showed no statistically significant difference in test AUC across the ten methods.

C. Configuration Sensitivity and Reliability

Stability: LDSC was the most stable family (lowest SD and CV), while DPR+GEMMA and GEMMA showed the highest instability, largely driven by HE regression variants producing extreme values in low-signal scenarios.
Negative Values: Negative estimates were not indicative of "failure" but rather a mathematical property of unconstrained estimators under low signal. Fold-based confidence intervals were found to be a more reliable indicator of validity than the sign of the point estimate.

D. Configuration Selection Heuristics

An exploratory "delta-constrained" heuristic (selecting configurations with small train-test gaps) outperformed "best-train" selection, which tended to favor overfit configurations. However, the study notes this is descriptive and not a formal external validation.

4. Key Contributions

Comprehensive Benchmark: The first systematic evaluation of 86 heritability configurations across six major tool families and 10 phenotypes, linking them directly to PRS performance.
Decoupling of $h^2$ and PRS: Demonstrated that while $h^2$ is highly sensitive to configuration, PRS predictive performance is largely insensitive to moderate variations in the $h^2$ input. This suggests that PRS pipelines are robust even if the exact "true" heritability is uncertain.
Reframing Negative Estimates: Provided evidence that negative heritability estimates are valid outputs of unconstrained models under specific conditions (low signal) rather than errors requiring method exclusion.
Reporting Standards: Established that reporting a single $h^2$ value without the full estimation specification (algorithm, GRM, preprocessing) is misleading, as different regimes can yield vastly different values with similar downstream utility.

5. Significance and Implications

For Practitioners: Researchers constructing PRS pipelines should not be overly concerned with selecting the "perfect" heritability estimator to maximize predictive accuracy, as performance is robust to these choices. However, they must report the full estimation specification alongside any $h^2$ value to ensure reproducibility.
For Clinical Translation: The robustness of PRS to heritability configuration variability supports the practical deployment of PRS as clinical risk stratification tools, even when the optimal estimation strategy for a specific trait is unknown.
Methodological Insight: The study highlights that $h^2$ is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar biological constant.

6. Limitations

The benchmark relies on UK Biobank data (European ancestry), limiting generalizability to other populations.
The matched-input analysis, while controlling for variant count, could not fully isolate estimator effects from all preprocessing nuances.
The configuration selection heuristics were evaluated using held-out test data within the study, meaning they are not yet validated as external, deployable selection rules.