Structured Schemas for LLM-Modeler Collaboration in Quantitative Systems Pharmacology Model Calibration

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to build a hyper-realistic video game simulation of how a specific type of cancer (pancreatic cancer) grows and how drugs fight it. This isn't just a simple game; it's a "Quantitative Systems Pharmacology" (QSP) model. It's a massive, complex engine with hundreds of moving parts (equations) that need to be tuned perfectly to match reality.

To tune this engine, you need calibration data—real numbers from scientific studies about how cells grow, how drugs kill them, and how the body reacts.

The Problem: The "Hallucinating Librarian"

Traditionally, scientists had to manually read thousands of research papers, find the right numbers, and write them down. This is slow, boring, and prone to human error.

Recently, scientists tried using AI (Large Language Models) to do this reading and writing for them. It's like hiring a super-fast, super-smart librarian who can read a million books in a second.

But there's a catch: This AI librarian is prone to "hallucinations." Sometimes, it confidently makes up numbers that don't exist, invents fake research papers, or misquotes a study. In a video game, a wrong number might make a character move too fast. In a medical model, a wrong number could lead to a drug failing in real life. You can't trust the AI blindly.

The Solution: MAPLE (The "Strict Foreman")

The authors of this paper created a system called MAPLE. Think of MAPLE not as a replacement for the human scientist, but as a strict construction foreman who manages the AI librarian.

Here is how MAPLE works, using simple analogies:

1. The Blueprint (Structured Schemas)

Instead of letting the AI just write a paragraph of text, MAPLE forces the AI to fill out a very specific, rigid digital form (a schema).

Analogy: Imagine the AI is a contractor building a house. Instead of saying, "I'll build a nice room," the foreman hands them a blueprint that says: "Wall must be exactly 10 feet high, made of brick, with a window here. If you don't follow these exact specs, the house is rejected."
This ensures the AI doesn't just guess; it has to fit the data into a strict format.

2. The "Receipt" Rule (Value-in-Snippet)

The most important rule in MAPLE is: "Show me the receipt."

If the AI says, "The cancer cell growth rate is 0.5," it must also paste the exact sentence from the original paper where that number appears.
Analogy: It's like a cashier who won't let you buy anything unless you show the price tag on the shelf. If the AI makes up a number, it can't find the "price tag" (the text snippet) to prove it. The system immediately catches the lie.

3. The Double-Check (Validators)

MAPLE has a team of automated inspectors (validators) that check the AI's work before a human ever sees it.

The DOI Detective: Checks if the research paper the AI cited actually exists.
The Math Police: Checks if the units make sense (e.g., making sure you didn't mix up "grams" and "kilograms").
The Code Inspector: Runs the math to make sure the formulas actually work.
Analogy: It's like a security checkpoint at an airport. If your ID (citation) is fake or your bag (data) has prohibited items (wrong units), you don't get through. The AI has to try again until it passes.

4. The Human-in-the-Loop (The Expert Pilot)

The paper found that the AI is great at finding information and filling out forms, but it's terrible at making scientific judgment calls.

Analogy: The AI is the autopilot that can fly the plane and navigate the map. But the human scientist is the pilot who has to decide where to fly, interpret the weather, and fix the engine if the autopilot gets confused.
In the study, the human scientist had to change the AI's work about 65% of the time. They didn't just fix typos; they had to decide, "This study was done on mice, but we need human data, so we need to adjust the numbers to account for the difference."

The Two Types of "Forms"

MAPLE uses two different types of forms for two different jobs:

The "Single Part" Form (SubmodelTarget): Used for simple, isolated experiments (like testing how fast a single cell grows in a petri dish). This helps tune specific parts of the engine.
The "Whole System" Form (CalibrationTarget): Used for complex, real-world scenarios (like how a whole tumor shrinks in a patient after treatment). This checks if the entire engine runs smoothly together.

The Result

By using MAPLE, the researchers were able to build a highly accurate model of pancreatic cancer.

The AI did the heavy lifting of reading papers and finding numbers.
The Validators caught the AI's lies and mistakes automatically.
The Human Scientist applied their expertise to interpret the data and make the final scientific decisions.

Why This Matters

This isn't about replacing scientists with robots. It's about teamwork.

Before: Scientists spent weeks reading papers and worrying they missed a detail.
Now: The AI acts as a tireless research assistant that gathers the raw materials, but the scientist acts as the architect who ensures the building is safe and sound.

The result is a model that is reproducible (you can see exactly where every number came from), trustworthy (we know the numbers are real), and efficient (we didn't waste time on fake data). It turns the chaotic process of "finding data" into a structured, reliable assembly line.

1. Problem Statement

Quantitative Systems Pharmacology (QSP) models, which consist of complex systems of ordinary differential equations (ODEs), require rigorous calibration against diverse literature data (from isolated in vitro assays to clinical endpoints).

The Bottleneck: Manual curation of this data is labor-intensive, prone to inconsistent documentation, and leads to a loss of institutional knowledge when personnel change.
The LLM Limitation: While Large Language Models (LLMs) offer a flexible alternative for extracting data, they suffer from hallucinations (fabricating values) and fabrication (creating fake citations). These errors are unacceptable for quantitative modeling where precision and provenance are critical.
The Gap: Existing automated extraction pipelines lack the specific validation mechanisms and structural rigor required to handle the diverse parameter types, uncertainty quantification, and traceability needs of QSP modeling.

2. Methodology: The MAPLE Framework

The authors present MAPLE (Model-Aware Parameterization from Literature Evidence), a framework designed to facilitate collaboration between LLMs and human modelers using structured validation schemas.

A. Core Components

Model-Aware Literature Search:
- The system injects mechanistic context (parameter names, units, reaction networks, species) into LLM prompts to guide web searches.
- It supports multiple LLM providers (OpenAI, Anthropic, Google) and uses specific models (GPT-5.1 for batch, Claude Opus 4.6 for interactive) to find relevant literature.
Dual-Schema Architecture:
The framework defines two complementary YAML-based schemas to separate data extraction (what the paper reports) from model specification (how the data is used).
- SubmodelTarget Schema: For isolated experiments (e.g., in vitro assays) that constrain individual parameters. It uses simplified forward models (e.g., exponential growth ODEs, algebraic equations) to derive priors.
- CalibrationTarget Schema: For clinical and in vivo endpoints that constrain the full model. It defines observables as Python functions mapping model states to measurable quantities and includes Monte Carlo simulations to derive summary statistics from raw literature data.
- Shared Features: Both enforce "data-first" principles, requiring verbatim text snippets for all numeric values, structured source documentation (DOI, authors), and explicit "Source Relevance Assessment" (e.g., species translation, indication match).
Targeted Validation Framework:
A multi-layered validation system using Pydantic catches characteristic LLM errors before human review:
- Value-in-Snippet Matching: Ensures extracted numbers appear verbatim in the provided source text snippet.
- DOI Resolution: Verifies citations against CrossRef metadata to detect fabricated references.
- Code Execution: Runs observation functions with mock data to catch logic errors and unit mismatches.
- External Validation: Fetches full papers (via Europe PMC/Unpaywall) to verify that snippets actually exist in the source document.
- Unit & Dimensional Checks: Uses the Pint library to ensure dimensional consistency.
Code Generation:
Validated SubmodelTarget specifications are automatically translated into Julia scripts using Turing.jl for Bayesian inference. The system handles joint inference for shared parameters across multiple targets.

B. Workflow Modes

Batch Extraction: Automated LLM extraction followed by a self-correcting retry loop (triggered by validators) and subsequent human curation.
Interactive Extraction: Real-time collaboration where the modeler guides the LLM (e.g., via Claude Code) during the extraction process, embedding expert judgment directly into the output.

3. Key Contributions

Structured Collaboration Interface: Demonstrates that structured schemas can effectively bridge the gap between LLMs and domain experts, clarifying where automation ends and human judgment begins.
Error Detection Mechanisms: Introduces specific validators (snippet matching, DOI resolution, code execution) that systematically detect and correct LLM hallucinations and fabrication.
Provenance-Rich Calibration: Ensures every data point is traceable to a specific source snippet, with explicit documentation of translation uncertainties (e.g., mouse-to-human scaling).
Dual-Scale Calibration: Provides a unified framework for calibrating both individual parameters (via simplified submodels) and full-system endpoints (via complex observables).

4. Results

The framework was evaluated on a Pancreatic Ductal Adenocarcinoma (PDAC) QSP model involving 87 calibration targets (18 SubmodelTarget and 59 CalibrationTarget).

Validation Performance:
- Zero First-Pass Success: None of the 18 initial batch extractions passed validation on the first attempt.
- Retry Efficiency: All required at least one automated retry. Most succeeded after 1–2 retries, though complex targets (e.g., CCL2 secretion) required up to 7 iterations.
- Error Categories: The most common errors caught were unit inconsistencies (38%), followed by prior specification issues (21%) and fabrication (17%).
Human Curation Impact:
- Substantial Modeler Input: Even with LLM assistance, modelers made significant changes:
  - Changed forward model types in 65% of SubmodelTarget files.
  - Adjusted prior parameters in 46% of files.
  - Revised source relevance assessments in 100% of files.
- Interactive vs. Batch: Interactive extraction (where modelers guided the LLM in real-time) resulted in near-final outputs with minimal post-hoc revision, whereas batch extraction required extensive manual correction.
Inference Success:
- All 18 SubmodelTarget parameters were successfully translated into a joint Julia inference script.
- The Bayesian inference converged successfully ( $\hat{R} < 1.01$ , ESS > 3000), producing biologically plausible estimates (e.g., tumor doubling time ~130 days).

5. Significance and Implications

Redefining the Human-in-the-Loop: The study argues that LLMs should not replace modelers but rather restructure the workflow. The schema acts as a contract where the LLM handles text processing and code generation, while the modeler focuses on high-level scientific decisions (model structure, uncertainty quantification, relevance assessment).
Reproducibility: By enforcing strict provenance (snippets, DOIs, translation justifications), MAPLE creates a reproducible audit trail for QSP calibration, addressing a major weakness in current manual practices.
Scalability: The framework is applicable to any parameter-rich mechanistic model, not just QSP. It provides a pathway to scale model calibration as the volume of scientific literature grows.
Limitations & Future Work: The system currently relies on LLM web search capabilities (which may miss closed-access papers) and struggles with extracting data from figures (requiring manual digitization). Future work aims to integrate multimodal vision capabilities and institutional authentication for broader paper access.

In conclusion, MAPLE demonstrates that combining structured schemas, rigorous automated validators, and human expertise creates a robust pipeline for QSP model calibration, turning the "hallucination-prone" nature of LLMs into a manageable, high-efficiency workflow.