LLM-PathwayCurator transforms enrichment terms into… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a massive crime scene. You have a huge pile of clues (genetic data) and a list of potential suspects (biological pathways). Your job is to figure out which suspects are actually guilty.

Traditionally, scientists have done this by running a computer program that spits out a list of "likely suspects" with some statistics. But here's the problem: The computer doesn't explain why it thinks they are guilty. It just gives a list. A human analyst then has to read the list, guess which clues matter, and write a story. This is like asking a detective to write a report based on a hunch, without checking if the evidence actually holds up. It's hard to repeat, and if you ask a different detective, they might write a completely different story.

Enter "LLM-PathwayCurator."

Think of this new tool as a super-strict, rule-following Judge who works alongside a Creative Assistant (the AI). Here is how it works, using a simple analogy:

1. The Evidence Locker (The "EvidenceTable")

First, the tool takes the messy list of clues and organizes them into a strict, digital evidence locker. Every single clue is tagged with exactly which suspect it points to. Nothing is left to guesswork.

2. The "Stress Test" (The Audit)

Before the Judge makes a decision, they put the evidence through a stress test.

The "What-If" Game: The tool asks, "What if we removed 5% of the clues? Does the suspect still look guilty?"
The "Wrong Context" Test: The tool asks, "What if we tried to pin this crime on a suspect from a different city? Does the evidence still make sense?"

If the evidence falls apart when you tweak it slightly, the tool knows the conclusion is fragile.

3. The Creative Assistant vs. The Strict Judge

This is where the "LLM" (Large Language Model) part comes in.

The Assistant (LLM): The AI is allowed to look at the evidence and say, "Hey, I think this suspect is guilty because of these specific clues." It writes a draft claim.
The Judge (The Audit Gates): The AI cannot make the final decision. Instead, a set of rigid, unbreakable rules (the Judge) checks the AI's work.
- Did the AI link the claim to the actual evidence in the locker? (Yes/No)
- Did the evidence survive the stress test? (Yes/No)
- Is the story consistent with the context? (Yes/No)

4. The Verdict: Pass, Abstain, or Fail

The Judge doesn't just say "Guilty" or "Not Guilty." It has three specific outcomes:

PASS: The evidence is solid, the context fits, and the AI's claim is backed by unshakeable proof. We can publish this.
FAIL: The evidence is broken, contradictory, or the AI made a mistake. Throw this out.
ABSTAIN (The most important one): This is the tool's superpower. If the evidence is weak or the context is unclear, the tool says, "I don't know, and I won't guess." It refuses to make a decision rather than making a risky one.

Why is this a big deal?

Imagine a doctor diagnosing a patient.

Old Way: "The computer says 'maybe cancer,' so I'll write a report saying 'likely cancer' based on my gut feeling." (High risk of error, hard to check).
New Way (LLM-PathwayCurator): "The computer found some clues. I stress-tested them. They failed the stress test. Therefore, I abstain from diagnosing cancer until we get better evidence."

The Bottom Line

This paper introduces a system that turns messy, subjective biological guesses into auditable, high-quality facts. It forces the AI to be honest: if the evidence isn't strong enough, it admits it doesn't know, rather than making up a story. It's like upgrading from a detective who guesses to a courtroom where every claim must be proven beyond a reasonable doubt before it's accepted.

1. Problem Statement

Pathway enrichment analysis is a standard method for interpreting omics data (e.g., RNA-seq), but it suffers from significant limitations in reproducibility and auditability:

Subjectivity: Analysts must manually select representative terms from clusters of near-duplicates and subjectively judge interpretation strength.
Lack of Verifiability: While Large Language Models (LLMs) can generate narrative interpretations, these free-text outputs are difficult to reproduce and cannot be rule-audited because they lack verifiable links between claims and specific evidence (term identifiers and supporting genes).
Fragility: Current methods cannot systematically audit interpretations for "evidence-link drift," internal contradictions, context specificity, or fragility when supporting genes are perturbed.

2. Methodology: LLM-PathwayCurator

The authors developed LLM-PathwayCurator, a deterministic workflow that transforms enrichment outputs into auditable, evidence-linked claims. The system operates on a "blueprint-first" design where the LLM is restricted to proposal-only steps, and a rule-based audit layer makes final decisions.

Core Workflow Components:

EvidenceTable Standardization:
- Ingests enrichment outputs from rank-based methods (e.g., fgsea, GSEA) or Over-Representation Analysis (ORA, e.g., Metascape).
- Normalizes them into a structured EvidenceTable containing a term–gene relationship, recording enriched terms, their supporting genes, and stable identifiers (term_uid).
- Generates a deterministic hash for the supporting gene set to ensure traceability.
Deterministic Evidence Distillation:
- Stability Scoring: Applies deterministic perturbations (dropout and jitter/addition) to supporting gene sets to compute a "survival-like" stability score for each term without re-running the enrichment.
- Module Factorization: Constructs a bipartite graph (terms vs. genes) and factorizes it into modules based on shared support. This groups redundant terms and highlights shared evidence.
- Gene Masking: Filters out locus-style identifiers (e.g., LOC*, LINC*) and lineage-variable patterns (TCR/Ig) to reduce noise while retaining broad biological programs.
Constrained LLM Proposal:
- The LLM is not allowed to generate free-text narratives or make final decisions.
- It operates in a "proposal-only" mode, reviewing a candidate pool against a Sample Card (defining context: condition, tissue, perturbation, comparison).
- It emits schema-bounded JSON claims containing resolvable links to the EvidenceTable (term/module IDs and gene-set hashes).
- The LLM selects representatives from near-duplicate clusters based on context consistency.
Audit-Gated Decision Layer:
- A mechanical audit suite evaluates proposed claims against predefined rules.
- Decision States: PASS, ABSTAIN, or FAIL.
- Audit Gates:
  - Evidence Integrity: Verifies that term IDs and gene-set hashes resolve correctly to the EvidenceTable.
  - Stability Threshold ( $\tau$ ): Claims must exceed a stability score threshold.
  - Context Validity: Checks for context mismatches (e.g., via "context swap" stress tests).
  - Contradiction Check: Flags claims with the same evidence but opposite directions.
- Output: A reason-coded audit log and a prioritized shortlist of PASS claims based on a deterministic utility score ( $U = \text{Strength} \times \text{Stability} \times \text{Context Fit}$ ).

3. Key Contributions

Audit-Gated Framework: Introduces a system where LLMs propose interpretations, but deterministic, rule-based gates enforce evidence integrity and context validity, shifting the paradigm from "narrative generation" to "decision-grade claims."
Reproducibility via Determinism: The workflow is deterministic by default. Even in LLM-assisted modes, the LLM is confined to selecting from a fixed pool, and all evidence links are verified by the system, not the model.
Risk-Coverage Trade-off Quantification: Defines a metric for "human non-accept risk" (fraction of PASS claims that humans would reject) and demonstrates how tuning the stability threshold ( $\tau$ ) allows users to trade coverage for conservativeness.
Redundancy Mapping: Provides a module map to visualize shared supporting genes, helping analysts select representative terms without re-endorsing near-duplicates.

4. Key Results

The system was evaluated across seven TCGA cancer cohorts (BRCA, HNSC, LUAD, LUSC, OV, SKCM, UCEC) and the BeatAML2 cohort.

Performance under Matched Context (Proposed):
- Achieved a qualified coverage (PASS rate) of 0.66 to 0.80 (33–40/50 claims) with a stability threshold $\tau = 0.2$ .
- Increasing $\tau$ to 0.9 reduced coverage (e.g., to ~0.70) but maintained low risk, acting as a conservative filter.
Robustness to Perturbations:
- Context Swap: When the study context was shuffled (e.g., BRCA data labeled as LUAD), PASS rates dropped significantly to 0.20–0.42, with the system correctly ABSTAINing due to context violations.
- Evidence Dropout: When supporting genes were randomly removed (5% dropout), PASS rates dropped to 0.20–0.30, demonstrating the system's ability to detect fragile evidence.
Risk Management:
- In the HNSC cohort, the system successfully shifted candidates from PASS to ABSTAIN under stress conditions rather than endorsing incompatible interpretations.
- LLM-Assisted vs. Deterministic: In a comparison at $\tau=0.8$ , the LLM-assisted run showed lower coverage (0.52 vs. 0.78) but significantly lower human non-accept risk (0.12 vs. 0.26), indicating that LLM proposals, when filtered by strict audit gates, can improve claim quality.
Generalizability: Results were consistent in the independent BeatAML2 cohort, confirming the method's applicability across different data sources.

5. Significance

Decision-Grade QA: LLM-PathwayCurator provides a "quality-assurance layer" for omics interpretation, transforming plausible but unauditable enrichment outputs into verifiable, evidence-linked claims.
Decoupling Proposal from Verification: By separating the LLM's role (proposal) from the system's role (verification), it ensures that biological conclusions are not based on hallucinated narratives but on stable, auditable evidence.
Risk-Aware Abstinence: The system explicitly models the trade-off between coverage and risk, allowing analysts to make informed decisions about when to abstain from making a claim due to context shifts or weak evidence.
Standardization: It establishes a protocol for generating machine-readable, schema-bounded claims that can be systematically audited for internal consistency, setting a new standard for reproducible computational biology.

LLM-PathwayCurator transforms enrichment terms into audit-gated decision-grade claims