Language models reveal evidence gaps in variants of uncertain significance

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: Is a specific genetic change (a "variant") a harmless glitch or a dangerous villain?

In the world of genetics, scientists have found millions of these changes. For many, they have enough clues to say, "This is definitely bad" (Pathogenic) or "This is definitely safe" (Benign). But for thousands of others, the evidence is missing or unclear. These are called Variants of Uncertain Significance (VUS). They are like suspects who haven't been cleared or charged yet. Because doctors can't be sure, they often can't use this information to help patients, leaving a huge gap in medical care.

The problem? The clues are buried in messy, unorganized notebooks.

The Problem: Messy Notebooks

When a lab submits a genetic variant to a public database (like ClinVar), they write a summary explaining why they think it's safe or dangerous. Sometimes they say, "We tested this in a lab, and it broke the protein." Other times they say, "We looked at 10,000 people, and no one had this change."

But these notes are written in free text. One lab might write a paragraph; another might use bullet points. Some say "functional evidence" clearly; others just hint at it. It's like having a library where every book is written in a different language and style. A human expert trying to find all the books that mention "functional tests" would have to read every single page manually. It's too slow, too expensive, and impossible to scale.

The Solution: The AI Detective

The authors of this paper built a digital detective using Large Language Models (AI) to read these messy notes and organize them.

Think of their system as a two-step sorting machine:

Step 1: The "What?" Detector.
The AI reads a summary and asks: "Does this text mention a lab test? Does it mention how common this is in the population? Does it mention a computer prediction?"
- Analogy: Imagine a librarian scanning a book's table of contents. If the book has a chapter on "Population Stats," the librarian puts a green sticker on it. If it has "Lab Tests," they put a blue sticker.
Step 2: The "Good or Bad?" Detector.
Once the AI knows what kind of evidence is there, it asks: "Does this evidence say the variant is dangerous or safe?"
- Analogy: The librarian now reads the "Lab Tests" chapter. If it says "The test failed," they mark it Dangerous. If it says "The test passed," they mark it Safe.

The Training: Teaching the AI

To teach this AI, the researchers didn't just guess. They created a massive training set called VETA.

They took thousands of existing, high-quality summaries from experts.
They used other AI models to double-check the work, ensuring the "green stickers" and "blue stickers" were placed correctly.
They trained their AI (based on a model called BioBERT, which is like a doctor who has read every medical textbook) to recognize these patterns.

The Results: Finding the Hidden Clues

Once the AI was trained, they let it loose on about 6,000 "Unsolved Cases" (VUS) that currently had no clear evidence in their written summaries.

The AI found something amazing: Many of these "unsolved" cases actually had the clues, they just weren't written down clearly enough for a human to spot quickly.

By combining the AI's findings with new data (like updated population numbers from the UK Biobank or new lab test results), they could re-evaluate these variants.

The Big Reveal: About 17% of these "uncertain" variants could now be confidently classified as either Likely Safe or Likely Dangerous.
The Impact: This affects thousands of people. For example, in a gene called LDLR (related to cholesterol), the AI found 124 variants that were stuck in "Uncertain" limbo. With the new evidence, 23 of them could be moved out of limbo and given a clear answer.

Why This Matters

Imagine a traffic jam where cars are stuck because the traffic light is broken.

Before: Experts had to stand on the corner manually checking every car to see if it was safe to move. It was slow, and many cars stayed stuck.
Now: The AI is a smart traffic camera system. It instantly scans every car, checks the database for new info, and tells the experts: "Hey, these 17% of cars have all the paperwork they need. Let's move them!"

This doesn't replace the human experts (the traffic cops). Instead, it gives them a priority list. It tells them, "Don't waste time checking these 10,000 cars; focus on these 1,000 that are ready to be solved."

The Bottom Line

This paper shows how AI can turn messy, unstructured medical notes into a clean, organized list of evidence. It helps doctors find the "missing links" in genetic diagnoses faster, potentially turning thousands of "unknowns" into clear answers that can save lives. It's not about replacing the doctor; it's about giving the doctor a super-powered magnifying glass.

1. Problem Statement

The Bottleneck: A significant number of rare coding variants in monogenic disease genes remain classified as Variants of Uncertain Significance (VUS). This limits their clinical utility because they cannot be confidently used for patient management.
The Data Challenge: While millions of variant classifications exist in public archives like ClinVar, often accompanied by rich free-text summaries, this evidence is unstructured. These narratives contain detailed descriptions of functional, population, and computational evidence but lack standardized links to ACMG/AMP evidence codes (e.g., PS3, PM2, BP4).
The Consequence: Without structured data, it is difficult to systematically query which evidence types are missing for specific variants or to identify which VUS might be reclassified as new evidence (e.g., new functional assays or biobank data) becomes available. Manual re-interpretation by experts is resource-intensive and does not scale.

2. Methodology

The authors developed a two-stage language model pipeline to transform unstructured clinical text into a structured evidence matrix.

A. Dataset Construction: VETA

Source: The authors created Variant Evidence Text Annotations (VETA), a dataset of 44,522 keyword-description pairs derived from 18,678 variant summaries in ClinVar and the ClinGen Evidence Repository.
Annotation Process:
- Used GPT-4o-mini to extract descriptive text corresponding to specific ACMG/AMP evidence codes.
- Applied a consensus filter using two independent instruction-tuned LLMs (Mistral-7b and Llama-3.1-8b) to verify that the extracted text correctly matched the evidence code. Only pairs agreed upon by both models were retained.
Coverage: The dataset spans all major ACMG evidence categories: Functional, Population, and Computational.

B. Model Architecture: Two-Stage BioBERT

The pipeline utilizes BioBERT-large (pre-trained on biomedical literature) fine-tuned in two distinct stages:

Stage 1 (Evidence Detection): Three independent binary classifiers determine the presence of specific evidence types in a text summary:
- Functional (PS3/BS3)
- Population (BA1/BS1/BS2/PM2/PS4)
- Computational (BP4/BP1/BP7/PP3)
Stage 2 (Pathogenicity Direction): For texts identified as containing evidence in Stage 1, a second set of classifiers determines the direction of the evidence:
- Pathogenic (e.g., PS3, PM2, PP3) vs. Benign (e.g., BS3, BA1, BP4).

Preprocessing: The pipeline includes sentence-level filtering (retaining only "evidence" sentences), negation detection (to handle phrases like "no functional evidence"), and deduplication.

C. Validation Strategy

The models were validated using three orthogonal approaches:

Internal Validation: Gene-stratified 5-fold cross-validation on the VETA dataset.
Expert-Curated Validation: Testing on independent ClinGen expert summaries (which have different structural formats than ClinVar submissions).
Quantitative/Orthogonal Validation: Comparing model predictions against external quantitative benchmarks:
- Functional: FUSE scores (from MaveDB/ProteinGym).
- Population: gnomAD allele frequencies and UK Biobank variant-level odds ratios.
- Computational: AlphaMissense and REVEL scores.
- Note: Thresholds were applied to these quantitative scores to create "ground truth" labels (e.g., FUSE > 1σ = Pathogenic) for validation.

D. Application Pipeline

The trained models were applied to ~6,000 ClinVar VUS submissions that lacked explicit functional or population evidence in their text summaries.

Re-classification Framework: For these "evidence-gap" variants, the authors integrated external data (FUSE, REVEL, UK Biobank, gnomAD, diagnostic codes) using an ACMG/AMP point-based scoring system.
Scoring: Evidence points were summed to generate a total strength score, mapping variants to five categories: Benign, Likely Benign, VUS, Likely Pathogenic, and Pathogenic.

3. Key Results

Model Performance:
- The models achieved high accuracy and F1 scores in both detecting evidence presence and determining pathogenicity direction.
- Generalization: Models trained on ClinVar text successfully generalized to expert-curated ClinGen summaries despite structural differences.
- Orthogonal Validation: There was highly significant separation between model-predicted groups and external quantitative scores:
  - Functional assays (FUSE): $p = 8.13 \times 10^{-30}$
  - Variant Allele Frequencies (gnomAD): $p = 4.11 \times 10^{-22}$
  - Computational predictions (AlphaMissense/REVEL): $p < 8.88 \times 10^{-16}$
- LLM Judge: An external LLM (Llama-3.1-8b) agreed with the model's predictions >90% of the time in high-confidence bins.
Evidence Gap Analysis:
- Applied to 6,084 VUS variants predicted to lack functional or population evidence in their text.
- Reclassification Potential: After integrating external data, 17% (1,082 variants) met quantitative thresholds for reclassification as Likely Benign, Benign, Likely Pathogenic, or Pathogenic.
- ClinGen Impact: Among VUS in genes reviewed by ClinGen Variant Curation Expert Panels (VCEPs), 21% (492 variants) met reclassification thresholds, representing high-priority targets for expert review.
Case Study (LDLR): In the LDLR gene, the model identified 124 VUS lacking functional evidence. Integrating new functional assay data allowed 23 of these to be potentially reclassified (19 Benign/Likely Benign, 4 Pathogenic/Likely Pathogenic).

4. Key Contributions

VETA Dataset: The creation of the first large-scale, high-confidence dataset linking ACMG evidence codes to free-text descriptions, enabling supervised training for biomedical evidence extraction.
Scalable Evidence Gap Detection: A generalizable digital pipeline that converts unstructured text into a structured evidence matrix, allowing for the systematic identification of missing evidence types across thousands of variants.
Prioritization Framework: A method to prioritize VUS for reclassification by combining model-predicted evidence gaps with newly available external data sources (biobanks, functional screens, updated predictors).
Validation Rigor: Demonstrated that text-based evidence extraction correlates strongly with orthogonal biological signals (functional scores, population frequencies), validating the biological relevance of the NLP approach.

5. Significance and Limitations

Significance:
- Efficiency: Drastically reduces the manual burden on expert curators by pre-screening variants and highlighting those most likely to be reclassified with new data.
- Scalability: Enables the continuous integration of emerging data sources (e.g., new biobank releases) into clinical variant interpretation without re-reading thousands of text summaries.
- Clinical Impact: Identifies specific subsets of VUS (e.g., 492 in ClinGen genes) that can be immediately targeted for expert review, potentially resolving uncertainty for thousands of patients.
Limitations:
- Text Dependency: The model relies entirely on the quality and completeness of the submitted text. If a lab mentions an evidence type was reviewed but provides no details, the model may struggle.
- Submission-Level vs. Variant-Level: Predictions are made per submission; discrepancies between multiple submissions for the same variant require manual or heuristic aggregation.
- Data Leakage: Some overlap exists between the text summaries and the external validation data (e.g., labs may have used the same computational tools mentioned in the text), though mitigated by using older submission dates for validation.
- Not Automated Reclassification: The system is designed as a prioritization tool to guide expert review, not to automate final clinical classification, which requires nuanced expert judgment.

Conclusion

This study demonstrates that Large Language Models can effectively extract structured ACMG evidence from unstructured clinical narratives. By transforming text into a structured evidence matrix, the authors provide a scalable solution to identify evidence gaps, enabling the systematic re-evaluation of VUS as new genomic and functional data becomes available. This approach bridges the gap between the volume of generated genomic data and the capacity of human experts to interpret it.