Causal variant capture in genotype discovery approaches drives polygenic prediction performance across traits and populations

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to predict how tall a person will be, or how likely they are to get a specific disease, just by looking at their DNA. Scientists use a tool called a Polygenic Score (PGS) to do this. Think of a PGS as a "genetic weather forecast." It doesn't tell you exactly what will happen, but it gives you a probability based on thousands of tiny genetic clues scattered throughout your DNA.

For years, scientists have used Genotyping Arrays to read these clues. You can think of an array like a standardized multiple-choice test. It has a pre-printed list of about 500,000 to 1 million specific questions (genetic spots) that it asks everyone. It's cheap, fast, and works well if the "test" was designed for the specific group of people taking it. However, it misses a lot of the story because it only looks at the questions it was programmed to ask.

Recently, we have started using Whole Genome Sequencing (WGS). This is like reading the entire book of life instead of just taking a multiple-choice test. It looks at every single letter in the DNA code (about 3 billion of them), catching rare and unique genetic variations that the multiple-choice test would completely miss.

The Big Question:
Does reading the whole book (WGS) actually give you a better "weather forecast" than the multiple-choice test (Array), or is the test good enough? And does it matter if you are trying to predict something common (like height) or something rare (like a specific type of cancer)?

What the Researchers Did

The authors of this paper took a huge dataset from the "All of Us" research program, which includes nearly 96,000 people from diverse backgrounds (European, African American, and Latino/Admixed American). These people had both the multiple-choice test (Array) and the full book (WGS) done on them.

They tried to predict 10 different traits (like height, blood pressure, and diabetes) using both methods to see which one was more accurate.

The Surprising Findings

1. It depends on the "Recipe" (The Method)
Imagine you are baking a cake.

Method A (Clumping): This is like a strict baker who says, "We can only use one ingredient from every shelf." If you have a whole pantry (WGS), this baker throws away 95% of your ingredients because they are too similar to others. In this case, the Array (which had fewer ingredients to begin with) actually worked just as well, or sometimes even better, because it didn't lose as much useful information.
Method B (LD-informed/PRS-CS): This is a smart baker who knows exactly how ingredients work together. They can use all your ingredients without wasting any. When the researchers used this smarter method, WGS (the full book) consistently won. It provided a more accurate forecast, especially for complex traits like height.

2. The "Missing Pieces" Problem
The study found that the main reason WGS is better is that it captures the "causal variants."

Analogy: Imagine trying to solve a mystery. The Array is like a detective who only interviews suspects who live on Main Street. WGS is a detective who interviews everyone in the city.
If the real culprit (the causal genetic variant) lives on a side street, the Array detective misses them. The WGS detective finds them.
However, the study also found a twist: Just having more clues doesn't always help if those clues are just "noise" (irrelevant information). Sometimes, having too much data without a smart filter (like the PRS-CS method) can actually confuse the prediction.

3. The Cost vs. Benefit

Arrays are like buying a newspaper: Cheap, quick, and good for the headlines.
WGS is like buying a library: Expensive, takes a long time to read, and requires a lot of storage space.
The researchers found that while WGS is generally more accurate, the extra cost and computing power might not be worth it for every trait. For rare diseases (sparse traits), the newspaper (Array) was often just as good. But for complex, common traits (like height), the library (WGS) gave a much clearer picture.

4. The Diversity Factor
Historically, genetic tests were designed mostly for people of European ancestry. This paper showed that WGS is particularly promising for people of African and Admixed ancestry. Because these groups have more genetic diversity, the "multiple-choice test" (Array) often misses their unique genetic markers. WGS, which reads everything, levels the playing field and provides fairer predictions for everyone.

The Bottom Line

This study tells us that Whole Genome Sequencing is the future, but we need to be smart about how we use it.

If you want the most accurate prediction for complex traits, WGS is the winner, but you need a sophisticated computer program (like PRS-CS) to make sense of all that data.
If you are looking at rare diseases or need a quick, cheap answer, the standard Array is still a very strong contender.
Most importantly, the ability to find the actual "culprit" genetic variants is what drives accuracy. Whether you use a cheap test or an expensive book, if you can't find the specific genetic clues that cause the disease, your prediction won't be very good.

In short: We are moving from taking a multiple-choice test to reading the whole book, but we need to learn how to read it efficiently to get the best results for everyone.

1. Problem Statement

Polygenic Scores (PGS) are critical tools for precision medicine, estimating genetic risk for complex traits. However, their accuracy is heavily influenced by the genotype discovery technology used.

Current State: Most PGS rely on genotyping arrays, which are cost-effective but limited to pre-selected common variants, requiring imputation to fill gaps. Imputation accuracy varies significantly across populations, often disadvantaging non-European groups.
The Gap: Whole-Genome Sequencing (WGS) offers comprehensive variant coverage (including rare variants) but is computationally expensive and its specific advantage over arrays for PGS prediction across diverse traits and ancestries remains unclear.
Core Question: Does WGS-based PGS consistently outperform array-based PGS? How do trait architecture (polygenicity vs. sparsity), population ancestry, and PGS construction methods (e.g., Clumping & Thresholding vs. LD-informed Bayesian methods) influence this performance?

2. Methodology

The study utilized a large-scale, paired dataset from the All of Us Research Program (v6), comprising 95,562 individuals who possess both high-coverage (30x) WGS and genotyping array data.

Target Cohorts: European (EUR), African American (AFR), and Admixed American (AMR) populations.
Traits: 10 complex traits with varying genetic architectures (heritability and polygenicity), including Height, blood traits (RBC, Leukocyte), lipids (HDL, TC), and diseases (Asthma, T2D, Breast/Colorectal Cancer).
Discovery Data: Multi-ancestry GWAS meta-analyses from the Pan-UK Biobank (Pan-UKBB) were used to derive effect sizes.
PGS Construction Methods:
1. Clumping and Thresholding (C+T): A standard, non-Bayesian approach.
2. PRS-CS: A Bayesian, LD-informed method that continuously shrinks effect sizes without aggressive clumping.
3. Pre-trained Models: Evaluation of models from the PGS Catalog to test generalizability.
Comparative Analysis:
- Benchmarked PGS performance ( $R^2$ or Nagelkerke's $R^2$ ) between Array and WGS data.
- Simulations: Modeled scenarios where the proportion of captured causal variants was systematically reduced to isolate the impact of variant capture on accuracy.
- Fine-mapping: Used SuSiE to identify likely causal SNPs and tested whether restricting PGS to only these variants improved performance.
- Computational Cost: Analyzed CPU time and data generation costs.

3. Key Contributions

Systematic Benchmarking: First large-scale comparison of PGS derived from paired array and WGS data across multiple ancestries and traits within a single cohort (All of Us).
Method-Specific Insights: Demonstrated that the superiority of WGS is highly dependent on the PGS algorithm used (C+T vs. PRS-CS).
Causal Variant Hypothesis: Provided empirical and simulation evidence that the proportion of causal variants captured is the primary driver of prediction accuracy, but also highlighted the "signal-to-noise" trade-off where including too many non-informative variants can degrade performance.
LD Panel Efficiency: Showed that for PRS-CS, using a restricted set of ~~1.1 million HapMap3 variants is often sufficient and computationally superior to using full WGS-derived LD matrices (~~7.3 million variants), particularly for non-European populations where full LD panels may introduce noise.

4. Key Results

A. Impact of PGS Method (C+T vs. PRS-CS)

C+T Method: WGS did not consistently outperform arrays. In many cases, array-based PGS performed better, particularly for sparse traits (e.g., cancers) and in specific populations (e.g., HDL in AMR/AFR).
- Reasoning: C+T relies on LD clumping. WGS variants (approx. 9M) were drastically reduced (~~94.5%) during clumping, potentially removing informative variants, whereas array variants (~~1M) saw a more modest reduction (~71%).
PRS-CS Method: WGS-based PGS generally outperformed array-based PGS for highly polygenic traits (e.g., Height, Leukocyte count) across most populations.
- Reasoning: PRS-CS utilizes LD information to shrink effect sizes, allowing it to leverage the denser variant coverage of WGS without the aggressive pruning of C+T.

B. Trait Architecture and Population

Polygenic Traits: WGS showed clear advantages for highly polygenic traits (like Height) where capturing more variants improves signal.
Sparse Traits: For traits with sparse genetic architecture (e.g., Breast Cancer, Colorectal Cancer), array-based PGS often performed comparably or better, likely due to the "noise" introduced by the vast number of non-causal variants in WGS when not perfectly filtered.
Ancestry: While EUR populations generally had the highest accuracy (due to discovery cohort bias), the performance gap between WGS and arrays varied. WGS did not universally close the accuracy gap for AFR populations, suggesting that array designs (like the Global Diversity Array used in All of Us) are becoming less biased toward EUR than previously thought.

C. Simulation and Fine-Mapping

Causal Capture: Simulations confirmed a dose-response relationship: prediction accuracy drops as the proportion of captured causal variants decreases.
Fine-Mapping Paradox: Restricting PGS to only statistically inferred causal SNPs (via SuSiE) decreased prediction accuracy. This suggests that current fine-mapping misses many true causal variants, and that variants in Linkage Disequilibrium (LD) with causal SNPs (tagging variants) are essential for robust prediction.
Signal-to-Noise: Simply increasing the number of variants (as in WGS) does not guarantee better accuracy; the signal-to-noise ratio is critical.

D. Computational and Economic Trade-offs

Cost: Genotyping arrays (~~$100/sample) are significantly cheaper than 30x WGS (~~$600/sample).
Computation: WGS-based PGS construction is computationally intensive.
- C+T: WGS required ~2x the CPU time of arrays.
- PRS-CS: WGS required ~2.2x to 18x more CPU time depending on the LD matrix size used.
Imputation: In the UK Biobank, imputed arrays performed comparably to or better than unimputed arrays, and sometimes rivaled WGS, suggesting high-quality imputation is a powerful alternative to WGS for many applications.

5. Significance and Conclusion

This study provides a nuanced framework for selecting genotype discovery technologies for polygenic risk prediction:

Context is King: There is no single "best" technology. WGS is superior for highly polygenic traits when using LD-informed methods (PRS-CS). Arrays remain competitive or superior for sparse traits and offer massive advantages in cost and computational efficiency.
Causal Capture is Key: The primary driver of PGS accuracy is the ability to capture causal variants, but this must be balanced against the inclusion of non-informative noise.
Future Directions: As WGS costs decline, it may become the standard. However, for immediate clinical application, especially in diverse populations, optimizing array designs and imputation reference panels remains a high-yield strategy. The study underscores the need for diverse, multi-ancestry GWAS to ensure that the "causal variants" captured by any technology are relevant across all populations.

In summary: While WGS offers a more comprehensive genetic view, its advantage in polygenic prediction is conditional. It excels with Bayesian methods for polygenic traits but faces diminishing returns and higher costs for sparse traits, where well-imputed arrays remain a highly efficient and accurate alternative.

Causal variant capture in genotype discovery approaches drives polygenic prediction performance across traits and populations

What the Researchers Did

The Surprising Findings

The Bottom Line

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

A. Impact of PGS Method (C+T vs. PRS-CS)

B. Trait Architecture and Population

C. Simulation and Fine-Mapping

D. Computational and Economic Trade-offs

5. Significance and Conclusion

More like this

Reconciling the effects of PMS2 in different repeat expansion disease models supports a common expansion mechanism

Effect heterogeneity reveals complex pleiotropic effects of rare coding variants

Effects of knockdown of autophagy pathway genes on C. elegans longevity are highly condition dependent

Federated single-cell QTL meta-analysis reveals novel disease mechanisms

Resolution of the D4Z4 repeat responsible for facioscapulohumeral muscular dystrophy with HiFi sequencing