From Protocol to Analysis Plan: Development and Validation of a Large Language Model Pipeline for Statistical Analysis Plan Generation using Artificial Intelligence (SAPAI)

This study developed and validated a Large Language Model pipeline that successfully drafts high-quality Statistical Analysis Plans from clinical trial protocols, demonstrating strong performance on descriptive content while highlighting the continued necessity for human oversight in complex statistical reasoning tasks.

Jafari, H., Chu, P., Lange, M., Maher, F., Glen, C., Pearson, O. J., Burges, C., Martyn, M., Cross, S., Carter, B., Emsley, R., Forbes, G.

Published 2026-03-19
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master chef preparing a very important, high-stakes banquet for a group of critical food critics (the scientific community). Before you even start cooking, you must write a Statistical Analysis Plan (SAP). This isn't just a recipe; it's a legally binding contract that says exactly how you will measure the taste, texture, and safety of every dish. If you change the plan halfway through because the food tastes "okay" but not "great," the critics will reject your entire meal.

Writing these plans is hard, time-consuming, and requires a brilliant chef (a statistician) to translate a vague idea ("We want to see if this new drug helps") into a rigid, mathematical blueprint.

This paper is about testing a new kitchen assistant: Artificial Intelligence (AI). Specifically, the researchers asked: Can an AI write this complex blueprint for us, saving the human chef time, without making dangerous mistakes?

Here is the story of their experiment, broken down simply:

1. The Setup: Teaching the AI to "Think" Like a Statistician

The researchers didn't just ask the AI, "Write a plan." That's like asking a robot to "cook dinner" without giving it a recipe; it would probably just burn toast.

Instead, they built a structured pipeline. Think of this as giving the AI a very specific, step-by-step instruction manual.

  • The "Vanilla" Attempt: First, they tried just giving the AI the trial protocol (the idea) and asking for a plan. The AI got confused, mixed up sections, and missed crucial details. It was like a student who read the assignment but didn't understand the rubric.
  • The "Structured" Approach: They then broke the task down. They told the AI: "First, write only the section about who can eat the food. Next, write only the section about how we measure the temperature." They gave it strict rules: "Don't invent facts," "Stick to the protocol," and "If a section doesn't apply, say 'Not planned'."

2. The Test: The AI vs. The Human Experts

The team took 9 real-world clinical trials (like testing a new therapy for depression or a device for ADHD). They asked three different top-tier AI models (GPT-5, Claude Sonnet, and Gemini) to write the analysis plans for these trials using their new structured instructions.

This resulted in 27 draft plans.

Then, they brought in human experts (experienced statisticians) to grade these drafts. They used a strict checklist with 46 items, scoring each one from 0 to 3:

  • 0: The AI missed it completely.
  • 1: It mentioned it, but got the details wrong (a dangerous hallucination).
  • 2: It was mostly right, but had small errors.
  • 3: Perfect. Accurate, clear, and ready to use.

3. The Results: The "Good Cop, Bad Cop" Dynamic

The results were a mix of "Wow!" and "Whoa, hold on."

The Good News (The "Good Cop"):
The AI was fantastic at the boring, administrative stuff. It could perfectly summarize the trial design, list the dates, and describe the participants. It was like a super-efficient secretary who never misses a detail on a form.

  • Accuracy: About 81–83% for these descriptive parts.
  • Conclusion: All three AI models performed equally well here. It doesn't matter which brand you use; they are all good at copying and organizing facts.

The Bad News (The "Bad Cop"):
When the AI had to do actual math and logic, it stumbled.

  • Accuracy: Dropped to 67–72% for complex statistical reasoning.
  • The Problem: The AI would sometimes suggest a math method that sounded professional but was actually wrong for the specific trial.
    • Analogy: Imagine the AI suggesting you measure the soup's temperature with a ruler instead of a thermometer. It looks like a tool, it's a tool, but it's the wrong tool for the job.
    • It might invent a "sensitivity analysis" (a backup plan) that the researchers never intended, or miss a crucial variable that makes the data less precise.

4. The Verdict: The AI is a Draftsman, Not the Architect

The paper concludes with a very important metaphor:

"LLMs are powerful accelerators for the trial statistician, but they remain, for now, the draftsman rather than the architect."

  • The Architect (The Human): You need a human to design the building, decide where the load-bearing walls go, and ensure the structure won't collapse. This requires deep understanding and judgment.
  • The Draftsman (The AI): The AI is excellent at drawing the blueprints, labeling the doors, and formatting the text. It saves the Architect hours of tedious work.

Why This Matters

If researchers start using AI to write these plans without a human checking the math, they risk publishing flawed science. The AI might write a plan that looks perfect on paper but leads to incorrect conclusions about whether a drug works.

The Takeaway:
AI is ready to be your super-fast assistant for writing the boring parts of clinical trial plans. It can save you days of work. But you cannot let it work alone. A human expert must always review the final document, especially the math parts, to ensure the "building" doesn't collapse.

The future isn't "AI replaces statisticians"; it's "Statisticians using AI to do their best work faster."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →