From Protocol to Analysis Plan: Development and Validation of a Large Language Model Pipeline for Statistical Analysis Plan Generation using Artificial Intelligence (SAPAI)

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master chef preparing a very important, high-stakes banquet for a group of critical food critics (the scientific community). Before you even start cooking, you must write a Statistical Analysis Plan (SAP). This isn't just a recipe; it's a legally binding contract that says exactly how you will measure the taste, texture, and safety of every dish. If you change the plan halfway through because the food tastes "okay" but not "great," the critics will reject your entire meal.

Writing these plans is hard, time-consuming, and requires a brilliant chef (a statistician) to translate a vague idea ("We want to see if this new drug helps") into a rigid, mathematical blueprint.

This paper is about testing a new kitchen assistant: Artificial Intelligence (AI). Specifically, the researchers asked: Can an AI write this complex blueprint for us, saving the human chef time, without making dangerous mistakes?

Here is the story of their experiment, broken down simply:

1. The Setup: Teaching the AI to "Think" Like a Statistician

The researchers didn't just ask the AI, "Write a plan." That's like asking a robot to "cook dinner" without giving it a recipe; it would probably just burn toast.

Instead, they built a structured pipeline. Think of this as giving the AI a very specific, step-by-step instruction manual.

The "Vanilla" Attempt: First, they tried just giving the AI the trial protocol (the idea) and asking for a plan. The AI got confused, mixed up sections, and missed crucial details. It was like a student who read the assignment but didn't understand the rubric.
The "Structured" Approach: They then broke the task down. They told the AI: "First, write only the section about who can eat the food. Next, write only the section about how we measure the temperature." They gave it strict rules: "Don't invent facts," "Stick to the protocol," and "If a section doesn't apply, say 'Not planned'."

2. The Test: The AI vs. The Human Experts

The team took 9 real-world clinical trials (like testing a new therapy for depression or a device for ADHD). They asked three different top-tier AI models (GPT-5, Claude Sonnet, and Gemini) to write the analysis plans for these trials using their new structured instructions.

This resulted in 27 draft plans.

Then, they brought in human experts (experienced statisticians) to grade these drafts. They used a strict checklist with 46 items, scoring each one from 0 to 3:

0: The AI missed it completely.
1: It mentioned it, but got the details wrong (a dangerous hallucination).
2: It was mostly right, but had small errors.
3: Perfect. Accurate, clear, and ready to use.

3. The Results: The "Good Cop, Bad Cop" Dynamic

The results were a mix of "Wow!" and "Whoa, hold on."

The Good News (The "Good Cop"):
The AI was fantastic at the boring, administrative stuff. It could perfectly summarize the trial design, list the dates, and describe the participants. It was like a super-efficient secretary who never misses a detail on a form.

Accuracy: About 81–83% for these descriptive parts.
Conclusion: All three AI models performed equally well here. It doesn't matter which brand you use; they are all good at copying and organizing facts.

The Bad News (The "Bad Cop"):
When the AI had to do actual math and logic, it stumbled.

Accuracy: Dropped to 67–72% for complex statistical reasoning.
The Problem: The AI would sometimes suggest a math method that sounded professional but was actually wrong for the specific trial.
- Analogy: Imagine the AI suggesting you measure the soup's temperature with a ruler instead of a thermometer. It looks like a tool, it's a tool, but it's the wrong tool for the job.
- It might invent a "sensitivity analysis" (a backup plan) that the researchers never intended, or miss a crucial variable that makes the data less precise.

4. The Verdict: The AI is a Draftsman, Not the Architect

The paper concludes with a very important metaphor:

"LLMs are powerful accelerators for the trial statistician, but they remain, for now, the draftsman rather than the architect."

The Architect (The Human): You need a human to design the building, decide where the load-bearing walls go, and ensure the structure won't collapse. This requires deep understanding and judgment.
The Draftsman (The AI): The AI is excellent at drawing the blueprints, labeling the doors, and formatting the text. It saves the Architect hours of tedious work.

Why This Matters

If researchers start using AI to write these plans without a human checking the math, they risk publishing flawed science. The AI might write a plan that looks perfect on paper but leads to incorrect conclusions about whether a drug works.

The Takeaway:
AI is ready to be your super-fast assistant for writing the boring parts of clinical trial plans. It can save you days of work. But you cannot let it work alone. A human expert must always review the final document, especially the math parts, to ensure the "building" doesn't collapse.

The future isn't "AI replaces statisticians"; it's "Statisticians using AI to do their best work faster."

1. Problem Statement

Statistical Analysis Plans (SAPs) are critical documents in randomized clinical trials that operationalize scientific objectives into reproducible analysis strategies. They ensure transparency, prevent data-driven discretionary decisions, and support regulatory compliance. However, SAP authoring is:

Resource-intensive: It requires significant time and cognitive effort from trial statisticians.
Complex: It involves translating protocol details into rigorous statistical methodologies (e.g., modeling strategies, sensitivity analyses).
Time-constrained: SAPs must be finalized before unblinded data analysis, often under tight deadlines.

While Large Language Models (LLMs) have shown promise in drafting protocols and systematic reviews, their ability to generate high-quality, protocol-compliant SAPs has not been rigorously validated against established methodological guidance. There is a risk of "AI slop"—generating plausible but methodologically incorrect statistical plans that could compromise trial integrity.

2. Methodology

The study developed and validated a structured LLM pipeline (SAPAI) using a human-in-the-loop approach.

A. Prompt Engineering Strategy

The authors moved from "vanilla" prompting to a structured, section-by-section prompting pipeline aligned with the Gamble et al. SAP content framework.

Iterative Development: Initial attempts (Stage 1) resulted in incomplete, unstructured outputs. Stage 2 introduced a modular prompt library.
Prompt Components:
- System Role: Defined as an "expert clinical trial statistician."
- Scope Control: Explicit instructions on what to include (e.g., randomization methods) and exclude (e.g., health economics in outcome sections).
- Protocol-Faithfulness: Strict constraints to prevent hallucinations; the model must only use information present in the input protocol.
- Handling Non-Applicability: Instructions to explicitly state "not planned" rather than extrapolating missing details.
- Few-Shot Prompting: Inclusion of examples to demonstrate expected specificity and formatting.

B. Experimental Design

Models Tested: Three leading LLMs were evaluated:
1. OpenAI GPT-5
2. Anthropic Claude Sonnet 4
3. Google Gemini 2.5 Pro
Dataset: Nine real-world clinical trial protocols (covering various therapeutic areas like psychosis, depression, and ADHD) were processed.
Output: 27 SAP drafts generated (9 protocols × 3 models).
Evaluation Framework:
- Checklist: A 46-item quality checklist derived from published SAP guidelines (Gamble et al.) and the PreSPEC framework.
- Scoring: Items were rated on a 0–3 ordinal scale (0 = not covered; 3 = accurate, detailed, and implementable).
- Raters: Double-scored by two independent trial statisticians (one protocol-familiar, one independent). Discrepancies were resolved by consensus.
- Item Classification: Items were categorized as Descriptive (administrative details, trial design) or Statistical (modeling, sensitivity analyses).

C. Statistical Analysis

Primary Endpoint: Binary accuracy (Score = 3 vs. Score < 3).
Model: Mixed-effects logistic regression with random intercepts for trial and item to account for clustering.
Hypotheses:
1. Accuracy differs across LLMs.
2. Accuracy differs between Descriptive and Statistical item types.

3. Key Results

A. Overall Performance

High Overall Accuracy: All three models achieved an overall accuracy of 77%–78%.
Model Comparison: There was no statistically significant difference in performance between GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro ( $p=0.79$ ). This suggests the capability is becoming a general feature of state-of-the-art models.

B. The "Descriptive vs. Statistical" Dichotomy

A critical finding was the significant performance gap based on content type ( $p < 0.001$ ):

Descriptive Items: High accuracy (81%–83%). Models excelled at extracting administrative details, eligibility criteria, and trial design descriptions.
Statistical Items: Lower accuracy (67%–72%). Models struggled with complex reasoning, such as:
- Specifying correct covariates for models.
- Defining appropriate sensitivity analyses.
- Handling multiplicity and missing data strategies.

C. Qualitative Error Analysis

Hallucinations: Models occasionally fabricated plausible but incorrect statistical methods (e.g., suggesting linear regression for repeated measures instead of mixed-effects models, or proposing inappropriate per-protocol analyses).
Omissions: Critical details for secondary outcomes were sometimes missed.
Subtle Errors: Some errors were methodologically sound in appearance but statistically suboptimal (e.g., reducing precision of estimates), which poses a higher risk than obvious factual errors.

4. Key Contributions

First Validation of SAP Generation: This is the first study to rigorously validate an LLM pipeline specifically for drafting Statistical Analysis Plans against established clinical trial guidelines.
Structured Prompting Framework: The authors provide a reproducible, section-by-section prompting strategy that significantly improves output structure and reduces hallucinations compared to generic prompting.
Benchmarking: Establishes a performance baseline showing that while LLMs are excellent "technical writers" for descriptive content, they are not yet reliable "statistical architects" for complex methodology.
Open Source Tools: The authors released the SAPAI pipeline (Streamlit app), source code, and validation analysis code to the public.

5. Significance and Implications

Efficiency Gains: LLMs can automate the "boilerplate" sections of SAPs (administrative details, design descriptions), potentially saving significant time for trial statisticians.
Human-in-the-Loop Necessity: The study concludes that human oversight is mandatory. LLMs should be viewed as drafting assistants, not autonomous agents. The "plausible hallucination" risk in statistical reasoning means expert review is essential to prevent methodological errors that could invalidate trial results.
Future Directions: The authors suggest that future improvements should focus on:
- Retrieval-Augmented Generation (RAG): Giving models access to external statistical guidelines.
- Agentic Workflows: Using one AI agent to draft and another to critique/review against the protocol.
- Integration: Simultaneous generation of protocols and SAPs to ensure alignment from the outset.

Conclusion: The study confirms that AI-assisted SAP authoring is feasible and highly effective for descriptive content but currently lacks the reliability for independent statistical design. It represents a powerful tool for augmenting, not replacing, the clinical trial statistician.