Clinical evidence yield as a framework for evaluating… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Problem: The "Unknown" Variants

Imagine your DNA is a massive instruction manual for building a human body. Sometimes, a typo happens in this manual. Most typos are harmless (like a misspelled word that doesn't change the meaning), but some are dangerous (like changing "stop" to "go").

Doctors use a system called ACMG/AMP to decide if a typo is dangerous. They look for clues. However, there are millions of typos that are so rare or confusing that doctors can't decide if they are good or bad. These are called Variants of Uncertain Significance (VUS). It's like having a traffic light that is stuck blinking yellow—you don't know if you should stop or go, which is terrifying for patients waiting for a diagnosis.

The Old Way: The "Sorting Hat" (AUROC)

To help solve this, scientists have built two types of tools:

Computational Predictors (VEPs): Super-smart AI computers that guess if a typo is bad based on patterns.
Multiplexed Assays (MAVEs): High-tech lab experiments that actually test thousands of typos in a petri dish to see what they do.

For years, we judged these tools using a metric called AUROC. Think of AUROC as a "Sorting Hat" test. It asks: "How well can this tool separate the 'bad' typos from the 'good' typos?"

If the tool puts all the bad ones in one pile and all the good ones in another, it gets a high score.
The Flaw: Just because a tool is good at sorting doesn't mean it's good at helping doctors. A tool might sort perfectly but only give vague answers like "maybe," which doesn't help a doctor make a life-or-death decision.

The New Idea: The "Evidence Yield" (MES)

This paper introduces a new way to judge these tools called Mean Evidence Strength (MES).

Instead of asking, "How well does it sort?", MES asks: "How much proof does this tool actually give us?"

The Analogy: The Detective's Case File
Imagine a detective trying to solve a crime.

The Old Way (AUROC): We judge the detective by how well they can tell the difference between a "guilty" suspect and an "innocent" suspect in a lineup.
The New Way (MES): We judge the detective by how much hard evidence they bring to the courtroom.
- Did they find a smoking gun? (Strong Evidence)
- Did they find a fingerprint? (Moderate Evidence)
- Did they just say, "It looks suspicious"? (Weak Evidence)
- Or did they say, "I have no idea"? (No Evidence)

MES calculates the average amount of "proof" a tool provides across all the typos it looks at. It converts the tool's score into standard "evidence points" that doctors can actually use in their guidelines.

What They Discovered

The researchers tested 12 different AI computers and 15 different lab experiments. Here is what they found:

Sorting $\neq$ Proving: Some tools were great at sorting (high AUROC) but terrible at providing proof (low MES). They were like a sorting hat that puts everyone in the right pile but refuses to tell you why.
The Lab Experiments (MAVEs) Won on Proof: Even though the lab experiments were sometimes worse at sorting than the AI, they provided more actual evidence (higher MES). It's like a lab test that might make a few mistakes in sorting, but when it does give an answer, it comes with a mountain of hard data.
The Winner (CPT-1): Among the AI computers, one called CPT-1 was the best. It didn't just sort well; it provided the strongest, most usable evidence for the most number of "unknown" variants.

Why This Matters

This new framework (MES) changes the game for geneticists.

Before: They might pick a tool because it had the highest "sorting score," only to find out later that the tool couldn't actually help them diagnose a patient.
Now: They can pick the tool that generates the most clinical evidence.

The Bottom Line:
This paper tells us to stop just looking at how well a tool guesses the answer. Instead, we should look at how much proof the tool gives us to help doctors solve the mystery of the "unknown" genetic typos. It's the difference between a tool that says "It's probably bad" and a tool that says "Here is the evidence proving it is bad."

1. Problem Statement

The interpretation of missense Variants of Uncertain Significance (VUS) remains a critical bottleneck in clinical genetics. While computational Variant Effect Predictors (VEPs) and experimental Multiplexed Assays of Variant Effect (MAVEs) can generate large-scale functional scores, their clinical utility is currently evaluated using traditional discrimination metrics like the Area Under the Receiver Operating Characteristic Curve (AUROC).

Limitations of AUROC: AUROC measures the ability to distinguish between known pathogenic and benign variants but fails to capture the strength of evidence a dataset provides under the American College of Medical Genetics and Genomics/Association of Molecular Pathology (ACMG/AMP) guidelines.
The Gap: A model can have high discrimination (high AUROC) but fail to assign strong, actionable evidence to specific variants because the score distributions overlap significantly or do not align well with clinical evidence thresholds. Conversely, a method might have lower discrimination but provide robust evidence for a large fraction of variants.
Need: A framework is needed to quantify the "clinical evidence yield" of these datasets in a way that aligns with Bayesian calibration and ACMG/AMP evidence categories.

2. Methodology

The authors introduce a new framework centered on Mean Evidence Strength (MES) and utilize the acmgscaler tool for gene-level calibration.

Gene-Level Bayesian Calibration:
- The study uses the acmgscaler R package to calibrate variant scores at the gene level.
- It models the distributions of known pathogenic and benign variants (from ClinVar) using bootstrapped kernel density estimation to estimate likelihood ratios.
- These likelihood ratios are converted into standardized ACMG/AMP evidence categories (Supporting, Moderate, Strong, Very Strong) based on a prior probability of pathogenicity (default set to 0.1).
- Data Filtering: To avoid circularity, the study focuses on "population-free" VEPs (models not trained on human clinical/population variants). 12 VEPs were analyzed across 367 disease genes. 15 MAVE datasets were also analyzed.
Definition of Mean Evidence Strength (MES):
- MES is defined as the average absolute evidence point value assigned to all variants in a dataset.
- Evidence points are assigned based on ACMG/AMP categories: Supporting (1), Moderate (2), Strong (4), Very Strong (8). Indeterminate variants receive 0.
- $MES = \frac{\sum \text{Evidence Points}}{\text{Total Variants}}$
- This metric aggregates the total clinical evidence yield across the entire score distribution, rather than just binary classification accuracy.
Comparison Strategy:
- The authors compared MES against AUROC across 367 genes.
- They analyzed the correlation between MES and the fraction of ClinVar VUS that receive moderate or stronger evidence.
- Sensitivity analyses were performed using different prior probabilities (0.1 vs. 0.0441).

3. Key Contributions

Introduction of MES: A novel quantitative metric that summarizes the clinical evidence yield of a dataset, moving beyond simple discrimination (AUROC) to measure the "strength" of evidence provided under ACMG/AMP guidelines.
Gene-Level Calibration Framework: Demonstrated the application of acmgscaler across a large scale (367 genes, 12 VEPs, 15 MAVEs) to convert raw scores into standardized clinical evidence.
Decoupling Discrimination from Evidence Yield: Showed that high AUROC does not guarantee high clinical evidence yield, and vice versa.
Benchmarking of Tools: Provided a comprehensive ranking of current VEPs and MAVEs based on their ability to reclassify VUS, identifying CPT-1 as the top-performing computational predictor.

4. Key Results

Discrepancy between AUROC and MES:
- While MES and AUROC are generally correlated (Spearman's $\rho \approx 0.86$ ), significant divergences exist.
- MAVEs vs. VEPs: MAVE datasets achieved the highest average MES (ranking 1st) despite having lower AUROC than many VEPs (ranking lower than 7 of 12 VEPs). This suggests experimental assays provide stronger graded evidence even if their binary discrimination is slightly weaker.
- VEP Performance: CPT-1 ranked first in average MES and provided moderate or stronger evidence for the largest fraction of ClinVar VUS. Other predictors like SaProt and PHACT showed notable rank shifts between AUROC and MES.
Non-Linear Relationship: The relationship between AUROC and MES is non-linear. AUROC values above ~0.95 are required to consistently achieve MES $\ge$ 2 (Moderate evidence). Below ~0.9, MES often drops below 2.
Distributional Insights:
- High AUROC does not guarantee high evidence in both directions (pathogenic and benign). A gene can have high AUROC but low MES if the variant distribution falls largely in the "indeterminate" region.
- Example: PDHA1 and SOX9 had identical AUROC (0.98) but different MES profiles due to the specific positioning of their pathogenic/benign peaks relative to evidence thresholds.
VUS Reclassification Utility:
- MES calculated across all variants strongly correlates ( $\rho = 0.97$ for pathogenic) with the actual fraction of ClinVar VUS receiving moderate+ evidence.
- CPT-1 consistently provided the highest fraction of VUS reclassified with strong evidence. For example, >60% of CFTR VUS received strong pathogenic evidence, and >70% of collagen gene VUS received very strong benign evidence.
Sensitivity to Priors: Lowering the prior probability of pathogenicity (from 0.1 to 0.0441) reduced absolute MES values and increased the fraction of indeterminate variants, but the relative ranking of methods remained consistent.

5. Significance and Implications

Clinical Utility: The study argues that AUROC is insufficient for evaluating tools intended for clinical variant interpretation. MES offers a more practical framework for assessing how much actionable evidence a tool provides to resolve VUS.
Tool Selection: The authors recommend CPT-1 as the current best choice for gene-level calibration in clinical settings due to its superior evidence yield.
Experimental vs. Computational: The finding that MAVEs yield higher MES than VEPs (despite lower AUROC) suggests that experimental functional data captures intermediate-impact variants that are crucial for graded evidence assignment, complementing computational predictions.
Circularity Warning: The authors highlight the growing risk of circularity, as ClinVar classifications increasingly rely on VEPs and MAVEs. This necessitates careful tracking of evidence sources to prevent inflated evidence strength in future calibrations.
Future Framework: MES provides a standardized way to compare and potentially combine computational and experimental evidence, though the assumption of conditional independence required for additive combination remains an open question.

The paper concludes that MES is a vital, complementary metric to AUROC, offering a direct link between variant effect scores and the clinical evidence required for genetic diagnosis.

Clinical evidence yield as a framework for evaluating computational predictors and multiplexed assays of variant effect