From Study Design to Executable Code: Automating Target Trial Emulation with Large Language Models

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a chef trying to recreate a famous dish from a food critic's review. The review says, "The soup was simmered for three hours with fresh basil and a pinch of salt."

In the world of medical research, this review is a study design. The "soup" is the analysis of patient data to see if a drug works. The problem is that translating that simple sentence into a working recipe (code) is incredibly hard. If one chef uses a gas stove and another uses an electric one, or if one measures salt in grams and the other in teaspoons, the final soup tastes different. In medical research, this means two teams studying the same drug might get different results just because they wrote their computer code differently.

This paper introduces THESEUS, a new tool that acts like a super-smart, robotic sous-chef to solve this problem.

Here is how it works, broken down into simple steps:

1. The Problem: The "Translation Gap"

Medical researchers often write their study plans in plain English (like the food critic's review). But to run the study on a computer, they need to translate that English into a very specific, rigid computer language called Strategus (part of the OHDSI ecosystem).

The Old Way: A human researcher has to read the English plan, understand the complex math, and then manually type out hundreds of lines of code. This is slow, prone to typos, and hard to copy-paste between different hospitals.
The Goal: We want to type a sentence in English and instantly get a perfect, error-free computer program.

2. The Solution: The "Two-Step Robot" (THESEUS)

The researchers built a system called THESEUS that uses Large Language Models (LLMs)—the same AI technology behind chatbots—to do the translation. It works in two distinct phases:

Step A: The "Architect" (Standardization)

First, the AI reads the messy, free-text description of the study (e.g., "We watched patients for one year after they started the drug").

The Analogy: Imagine the AI is an architect reading a client's vague sketch. It doesn't just guess; it forces the sketch into a strict blueprint (a JSON file).
It asks: "Did you mean 365 days? Did you mean to start counting the day the drug was given?"
It fills out a standardized form where every field has a specific rule. This ensures that "one year" is always understood as "365 days" and "start date" is always in the same format.

Step B: The "Builder" (Code Generation)

Once the blueprint is perfect, the AI switches roles to become a construction worker.

The Analogy: Now that the architect has the perfect blueprint, the builder doesn't need to guess where the walls go. They just follow the blueprint to build the house.
The AI takes the standardized blueprint and automatically writes the R code (the computer program) needed to run the study.
The Safety Net: The AI has a "self-check" feature. If the code it wrote has a typo or an error, the AI reads the error message, fixes the code, and tries again until it runs perfectly.

3. The "Human-in-the-Loop" (The Taste Test)

The researchers didn't just let the robot run wild. They built a Graphical User Interface (GUI)—a visual dashboard that looks like a control panel.

The Analogy: Think of it like a video game character creator. The AI suggests the settings based on your text, but you can look at the screen, see the "Time at Risk" or "Drug Match" settings, and say, "Actually, change that to 2 years," or "No, keep it as is."
This ensures a human expert always has the final say before the code is generated.

4. The Results: Does the Robot Cook Good Soup?

The team tested this system on 15 real medical studies and 5 studies from outside their network.

Accuracy: The AI was incredibly good at turning English text into the correct "blueprint" (about 90-98% accurate for studies already using their system).
Code Success: When the AI wrote the code, it worked on the first try about 80-100% of the time. If it failed, the "self-check" feature fixed it, bringing the success rate to nearly 100%.
Generalization: It even worked well on studies that didn't originally use their system, proving it can translate ideas from different "kitchens."

Why This Matters

Before this, only experts who knew both medical statistics and advanced computer programming could easily run these studies.

The Impact: THESEUS lowers the barrier to entry. It allows more researchers (and potentially more diverse teams) to run high-quality, reproducible studies just by describing what they want to do in plain English.
The Future: It turns the chaotic, messy process of writing research code into a standardized, reliable assembly line.

In short: THESEUS is a translator that turns your "I want to study this drug" into a "Here is the perfect, error-free computer program to study that drug," ensuring that everyone in the world is cooking the same recipe, no matter which kitchen they are in.

From Study Design to Executable Code: Automating Target Trial Emulation with Large Language Models

1. The Problem: The "Translation Gap"

2. The Solution: The "Two-Step Robot" (THESEUS)

Step A: The "Architect" (Standardization)

Step B: The "Builder" (Code Generation)

3. The "Human-in-the-Loop" (The Taste Test)

4. The Results: Does the Robot Cook Good Soup?

Why This Matters

1. Problem Statement

2. Methodology: THESEUS Framework

Step 1: Standardization (Text-to-JSON)

Step 2: Code Generation (JSON-to-R)

Evaluation Design

3. Key Contributions

4. Results

Standardization Performance (Text-to-JSON)

Code Generation Performance (JSON-to-R)

5. Significance and Impact

From Study Design to Executable Code: Automating Target Trial Emulation with Large Language Models

1. The Problem: The "Translation Gap"

2. The Solution: The "Two-Step Robot" (THESEUS)

Step A: The "Architect" (Standardization)

Step B: The "Builder" (Code Generation)

3. The "Human-in-the-Loop" (The Taste Test)

4. The Results: Does the Robot Cook Good Soup?

Why This Matters

1. Problem Statement

2. Methodology: THESEUS Framework

Step 1: Standardization (Text-to-JSON)

Step 2: Code Generation (JSON-to-R)

Evaluation Design

3. Key Contributions

4. Results

Standardization Performance (Text-to-JSON)

Code Generation Performance (JSON-to-R)

5. Significance and Impact

More like this

A case report on gendered biases in a Finnish healthcare AI assistant

An End-to-End Synthetic Oncology Clinical Trial Framework Integrating Radiographic Response, Circulating Tumor DNA, Safety, and Survival for Decision-Oriented Clinical Data Science

Who is leading medical AI? A systematic review and scientometric analysis of chest x-ray research

High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data

Perception of Safety in Behavioral Health Crisis Units among Patients and Care Partners versus Artificial Intelligence (AI): A Multimethod Study