Agentic Trial Emulation to Learn Health System-specific Drug Effects At Scale

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: Why "Real Life" Doesn't Always Match the "Textbook"

Imagine you buy a high-end recipe book (a Randomized Clinical Trial or RCT). The book promises that if you bake a specific cake with exact ingredients, it will rise perfectly and taste amazing. This is the "gold standard" of cooking.

Now, imagine you try to bake that same cake in your own kitchen (a Health System using Electronic Health Records or EHR). You use your own oven, your own brand of flour, and your own cooking style. Sometimes, the cake turns out great. But often, it's a little flat, or it burns on the edges, or it tastes slightly different.

The Problem: For a long time, doctors and scientists have looked at these "flat cakes" (real-world data) and thought, "We must have done something wrong. We messed up the recipe." They assumed the textbook was perfect and the real-world attempt was flawed.

The New Insight: This paper argues that the "flat cake" isn't necessarily a mistake. It's actually a signature of your specific kitchen. Your oven runs hotter, your flour is different, and your neighborhood has different humidity. The difference between the textbook and your kitchen isn't just "error"—it's data about how your specific system works.

The Solution: The "AI Sous-Chef" (Biomni)

The researchers built an AI agent named Biomni (think of it as a super-smart, tireless AI Sous-Chef) to solve this.

The Task: Instead of just baking one cake once, the AI was told to bake the same cake from the recipe book five different times, in three different ways each time. It did this for five different famous recipes (trials about blood thinners).
The Magic: The AI didn't just bake; it kept a detailed log of why the cake turned out the way it did. It compared its "kitchen results" against the "textbook results" every single time.
The Pattern: By doing this over and over, the AI started to notice a pattern. "Hey, every time we bake this specific type of cake in the Mount Sinai kitchen, it comes out 10% flatter than the book says, no matter who bakes it."

The "Calibration" Machine

Once the AI noticed these patterns, the researchers used a special mathematical tool (a Bayesian Model) to act like a translator.

Old Way: "The book says the cake is perfect. Your kitchen made it flat. You failed."
New Way: "The book says the cake is perfect. We know your kitchen makes cakes 10% flatter on average. So, when we see a flat cake here, we know it's actually a 'perfect' cake for this kitchen."

The AI learned to calibrate the results. It took the "textbook" promise and adjusted it to fit the "local reality" of the hospital.

The Results: A "Local Truth"

When they tested this system:

Before Calibration: The AI's predictions were often way off from the textbook (like guessing the cake would be a pancake).
After Calibration: The AI's predictions became incredibly accurate. It could look at a new recipe it had never seen before (a different drug comparison) and say, "Based on how our kitchen handles other recipes, here is exactly how this new cake will turn out in our hospital."

They even tested it on a completely different type of recipe (Aspirin vs. Warfarin) that the AI hadn't practiced on, and it still got the prediction right. This proved the AI learned the "personality" of the hospital, not just the specific recipes.

Why This Matters to You

Think of this as a GPS for Medical Decisions.

Without this: A doctor looks at a study and says, "This drug works 90% of the time." They give it to a patient.
With this: The doctor looks at the study, then checks the "Hospital GPS." The GPS says, "In our specific hospital, with our specific patients and doctors, this drug actually works 85% of the time because of how we manage care."

The Takeaway:
This paper shows that when real-world results don't match clinical trials, it's not always a failure. It's a feature. By using AI to study why they differ, we can create a "local truth" that helps doctors make better, safer decisions for the specific patients in their own hospitals. It turns "disagreement" into "learning."

1. Problem Statement

Translating Randomized Clinical Trial (RCT) results into specific healthcare settings is a persistent challenge. While Electronic Health Record (EHR)-based "target trial" emulation is a standard method for bridging this gap, emulated results frequently diverge from published RCT findings.

The Interpretability Gap: Current workflows typically treat EHR-RCT discrepancies as methodological failures (e.g., residual confounding, data quality issues) to be minimized. This approach discards valuable information, failing to recognize that discrepancies often encode structured, learnable properties of a specific health system's data-generating process (e.g., prescribing habits, patient mix, outcome ascertainment).
The Scale Barrier: Learning these system-specific "transport properties" requires analyzing discrepancies across many trials, drugs, and endpoints. Manual emulation is too labor-intensive to achieve the necessary scale for cumulative learning.
The Core Question: Can autonomous agents generate standardized emulations at scale, and can the resulting pattern of discrepancies be modeled to calibrate local treatment effect estimates?

2. Methodology

The authors propose a two-stage framework combining autonomous agentic workflows with Bayesian hierarchical calibration.

A. Agentic Trial Emulation (The "Biomni" Agent)

Agent Architecture: They deployed Biomni, an autonomous Large Language Model (LLM) agent built on LangChain/LangGraph. It operates without human intervention between pipeline stages.
Execution: The agent executes an end-to-end emulation pipeline against an OMOP Common Data Model (CDM) database (Mount Sinai Health System).
Pipeline Steps:
1. Protocol Parsing: Reads trial manuscripts to extract inclusion/exclusion criteria and endpoints.
2. Concept Construction: Maps trial definitions to OMOP concepts (RxNorm, SNOMED CT).
3. Cohort Building: Constructs exposure and comparator cohorts.
4. Confounder Adjustment: Implements propensity score weighting (IPTW) or covariate-adjusted Cox models.
5. Estimation: Calculates treatment effects (Log-Hazard Ratios).
6. Literature Synthesis: Automatically searches literature to construct comparison-specific priors for expected EHR-RCT disagreement.
Replication: To quantify agent-induced variability, each trial emulation was run three independent times per trial, treating the outputs as exchangeable measurements.

B. Bayesian Hierarchical Calibration

The core innovation is modeling the discrepancy ( $\delta$ ) between the EHR estimate ( $\hat{\tau}^{EHR}$ ) and the published RCT result ( $\tau^{trial}$ ) as a decomposition of three components:
$\delta = \mu_{lit,k} + \mu_{site} + \epsilon$

$\mu_{lit,k}$ (Literature Prior): A comparison-specific expectation of reproducibility derived from the agent's synthesized literature meta-analysis. This captures general observational bias for a specific drug comparison.
$\mu_{site}$ (Institutional Shift): A shared, latent parameter representing the systematic deviation of the local health system (Mount Sinai) from the literature expectation. This captures system-level factors like local prescribing culture or warfarin management quality.
$\sigma$ (Residual Heterogeneity): Captures unexplained trial-to-trial variation.

The model uses a Bayesian framework (implemented in PyMC) to estimate the local causal effect ( $\tau^{local} = \tau_k + \mu_{site}$ ) and its uncertainty, effectively "calibrating" the RCT evidence to the local context.

3. Key Contributions

Agentic Scale: Demonstrated that autonomous LLM agents can execute complex, end-to-end clinical trial emulations (including literature retrieval and statistical modeling) with high reproducibility across multiple runs.
Reframing Discrepancy: Shifted the paradigm from viewing EHR-RCT disagreement as "error" to viewing it as structured data that reveals how a health system transforms external evidence.
Calibration Framework: Developed a hierarchical model that separates drug-specific reproducibility expectations from institution-specific systematic shifts, allowing for uncertainty-aware local inference.
Generalization: Showed that a model trained on DOAC-vs-Warfarin trials could successfully calibrate an out-of-distribution trial (Apixaban vs. Aspirin), suggesting the learned "institutional shift" is a reusable system property.

4. Results

The study emulated five atrial fibrillation anticoagulation trials (ARISTOTLE, ROCKET AF, RE-LY, ENGAGE AF, and AVERROES) using Mount Sinai's EHR data.

Calibration Performance:
- In leave-one-out cross-validation (4 DOAC trials), calibration reduced the Mean Absolute Error (MAE) of the log-hazard ratio from 0.567 to 0.224 (a 60.5% reduction).
- Achieved 100% empirical coverage (4/4) for the 95% posterior predictive intervals of held-out trials.
Institutional Shift ( $\mu_{site}$ ):
- The posterior for the institutional shift was consistently positive (median 0.364–0.580).
- Interpretation: This indicates a systematic attenuation of the observed DOAC benefit in the local EHR compared to RCTs, beyond what general literature suggests. This likely reflects high-quality local warfarin management or specific adherence patterns.
Out-of-Distribution (OOD) Performance:
- For the AVERROES trial (Apixaban vs. Aspirin), which was not in the training set, the uncalibrated error was 0.379. After applying the learned institutional shift, the error dropped to 0.051 (86.5% reduction), and the published result fell within the 95% credible interval.
Agent Variability: Pooling results from the three independent agent runs per trial stabilized the calibration, demonstrating that agent stochasticity can be treated as a source of estimable uncertainty rather than a failure mode.

5. Significance and Implications

From "Truth" to "Context": The paper argues that neither the RCT nor the EHR estimate is the absolute "ground truth." Instead, both are noisy projections of a latent effect, filtered through different regimes (randomized vs. routine care). The framework quantifies this filter.
Decision Support: Clinicians can use the calibrated posterior distribution to assess whether a trial's reported benefit is likely to persist in their specific patient population, accounting for institutional biases.
Scalable Learning: By automating the emulation process, health systems can accumulate a "library" of discrepancies. Over time, this allows for the identification of structured sources of divergence (e.g., specific drugs that consistently underperform in local practice) and the refinement of local clinical guidelines.
Future Direction: This approach moves beyond individualized "digital twins" to system-level learning, acknowledging that treatment effects are properties of patients embedded within specific care regimes.

In summary, this work establishes a feedback loop where autonomous agents generate the scale of data necessary to learn how a specific health system modifies external evidence, transforming trial emulation from a validation exercise into a tool for continuous, data-driven institutional learning.

Agentic Trial Emulation to Learn Health System-specific Drug Effects At Scale

The Big Idea: Why "Real Life" Doesn't Always Match the "Textbook"

The Solution: The "AI Sous-Chef" (Biomni)

The "Calibration" Machine

The Results: A "Local Truth"

Why This Matters to You

1. Problem Statement

2. Methodology

A. Agentic Trial Emulation (The "Biomni" Agent)

B. Bayesian Hierarchical Calibration

3. Key Contributions

4. Results

5. Significance and Implications

More like this

A case report on gendered biases in a Finnish healthcare AI assistant

An End-to-End Synthetic Oncology Clinical Trial Framework Integrating Radiographic Response, Circulating Tumor DNA, Safety, and Survival for Decision-Oriented Clinical Data Science

Who is leading medical AI? A systematic review and scientometric analysis of chest x-ray research

High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data

Perception of Safety in Behavioral Health Crisis Units among Patients and Care Partners versus Artificial Intelligence (AI): A Multimethod Study