Early Risk Stratification of Dosing Errors in Clinical Trials Using Machine Learning

Imagine you are the captain of a massive fleet of ships (clinical trials) about to set sail to discover new medicines. Your goal is to get everyone to the destination safely. But there's a dangerous storm ahead: dosing errors. These are like giving the crew the wrong amount of food or medicine, which can make them sick or ruin the whole voyage.

In the past, you only found out about these mistakes after the ship had already sailed and people got hurt. This paper introduces a Crystal Ball powered by Artificial Intelligence that lets you see the storm before you even leave the harbor.

Here is how the researchers built this crystal ball, explained in simple terms:

1. The Data: Reading the Ship's Blueprints

The researchers didn't look at the actual ships (the patients) because the ships hadn't left yet. Instead, they looked at the blueprints and logs (the trial protocols) available on a public website called ClinicalTrials.gov.

They gathered information on 42,000+ past voyages (completed trials). They looked at two types of clues:

The Hard Numbers: How many people are on board? Is the ship a small boat or a giant liner? Is it a new type of engine (drug)?
The Captain's Notes: The free-text descriptions written by the scientists explaining how the trip will work.

2. The Training: Teaching the AI to Spot Trouble

They taught a computer (Machine Learning) to look at these blueprints and say, "This ship looks risky," or "This ship looks safe."

To do this, they had to teach the computer what a "dosing error" actually looks like. They used a special medical dictionary (MedDRA) to find reports of "overdoses" or "wrong medicine" in the past logs. They found that about 4.6% of the past voyages had these errors.

They trained three different types of AI detectives:

Detective A (XGBoost): Only looked at the hard numbers (the stats).
Detective B (ClinicalModernBERT): Only read the Captain's Notes (the text).
Detective C (The Late-Fusion Team): A super-team that combined the notes from Detective A and Detective B.

The Result: The Super-Team (Detective C) was the best at spotting trouble, with an accuracy score of 86%.

3. The Calibration: Turning "Maybe" into "Risk Levels"

Here is the most important part. AI is often bad at giving exact percentages. It might say, "There's an 85% chance of rain," when it's actually only 50%. If you trust that blindly, you might bring an umbrella when you don't need one, or forget it when you do.

The researchers added a "Calibration Filter" (like a translator). This filter took the AI's vague guesses and turned them into reliable risk categories:

Low Risk: "The weather looks clear." (Less than 2% chance of error).
Moderate Risk: "Clouds are gathering." (2% to 5% chance).
High Risk: "Storm clouds are heavy." (5% to 10% chance).
Very High Risk: "Hurricane warning!" (Over 10% chance).

They tested this, and it worked perfectly. The "Very High Risk" group actually had a much higher rate of real errors than the "Low Risk" group. The AI wasn't just guessing; it was telling the truth about the danger level.

4. Why This Matters: The "Pre-Flight Check"

Why do we care? Because in the past, if a trial was going to have dosing errors, we often didn't know until people got hurt or the trial failed. This wastes billions of dollars and, more importantly, endangers lives.

This new system acts like a Pre-Flight Safety Check for clinical trials.

Before the trial starts: The AI looks at the plan.
The Verdict: "Hey, this plan has a 'Very High Risk' of dosing errors."
The Action: The scientists can go back, fix the plan, double-check the dosing instructions, or add extra safety guards before a single patient is enrolled.

The Big Takeaway

This paper shows that mistakes in medicine trials aren't just random bad luck. Often, the risk is baked into the design of the trial itself. By using AI to read the plans early, we can fix the blueprint before we build the building.

It's like realizing that a bridge design has a flaw in the math before you pour the concrete. You save money, time, and lives by catching the error when it's just a drawing on a piece of paper.

In short: They built a smart tool that reads the fine print of medical trials to predict which ones are likely to mess up the medicine dosage, allowing doctors to fix the problems before anyone gets hurt.

1. Problem Statement

Medication errors in Clinical Trials (CTs) pose significant risks to participant safety, data integrity, and regulatory compliance, contributing to the high failure rate of drug development. While Machine Learning (ML) has been applied to detect medication errors in routine clinical care, there is a critical gap in addressing errors within research and development settings. Existing literature lacks frameworks for early risk stratification of CTs based on information available prior to trial initiation. The challenge lies in predicting the likelihood of a high dosing error rate using heterogeneous data (structured design features and unstructured protocol text) to enable proactive quality management.

2. Methodology

Data Construction

Source: The study utilized the ClinicalTrials.gov registry, a publicly accessible database maintained by the U.S. National Library of Medicine.
Cohort: A dataset of 42,112 completed or terminated interventional trials was constructed.
Inclusion Criteria: Trials must be interventional, have reported results, have a completion date, and be registered before September 1, 2025.
Feature Extraction:
- Structured Data: Categorical (e.g., study phase, masking, intervention type), binary (e.g., healthy volunteers), and numerical (e.g., enrollment count, number of arms) features.
- Unstructured Data: Free-text fields from protocols, including briefSummary, detailedDescription, interventionNames, and armDescriptions.
- Temporal Constraint: Only data available prior to trial initiation was used to prevent temporal leakage.
Labeling Strategy:
- Target: Dosing errors identified via MedDRA (Medical Dictionary for Regulatory Activities) terminology.
- Process: Clinical pharmacology experts curated 210 MedDRA concepts related to overdoses, underdoses, and medication errors, canonicalizing them to 81 concepts.
- Thresholding: A trial was labeled "positive" (high risk) if the lower bound of the 95% Wilson confidence interval for its dosing error rate exceeded 0.01%. This resulted in a positive class rate of 4.62%.

Model Architecture

The study evaluated three approaches to integrate multimodal data:

XGBoost: Trained on structured and numerical features.
ClinicalModernBERT: A transformer-based model fine-tuned on textual protocol data.
Late-Fusion Model: A simple ensemble combining the predicted probabilities of the XGBoost and ClinicalModernBERT models via a weighted average (weights optimized on the validation set).

Evaluation & Calibration

Splitting Strategy: To avoid duration-dependent selection bias (where shorter trials are overrepresented in later time windows), the dataset was split chronologically based on completion dates (70/15/15 Train/Val/Test) rather than initiation dates.
Metrics: AUC-ROC, Precision, Recall, F1-score, Balanced Accuracy, and the Brier score (for probabilistic accuracy).
Probability Calibration: Crucial for risk stratification, post-hoc calibration (Platt scaling for BERT/LateFusion, Isotonic regression for XGBoost) was applied to ensure predicted probabilities accurately reflected true risk levels.
Risk Stratification: Trials were categorized into four risk groups based on calibrated probabilities ( $\hat{p}$ $\overset{p}{^}$ ):
- Low Risk: $\hat{p} < 2\%$
- Moderate Risk: $2\% \le \hat{p} < 5\%$
- High Risk: $5\% \le \hat{p} < 10\%$
- Very High Risk: $\hat{p} \ge 10\%$

3. Key Contributions

First Multimodal Framework for CT Dosing Errors: Introduced a reproducible ML framework that integrates structured trial design data with unstructured protocol text to predict dosing error risks before a trial begins.
Public Resource Release: The curated dataset (augmented with features and labels), the full data extraction pipeline (automated ingestion of ClinicalTrials.gov), and the codebase are publicly available on Hugging Face and GitHub.
Methodological Insight on Calibration: Demonstrated that while discrimination (AUC) is important, probability calibration is essential for translating model outputs into actionable, interpretable risk categories. Uncalibrated models failed to provide reliable stratification.
Simple Fusion Efficacy: Showed that a simple late-fusion strategy outperformed complex architectures, proving that structured and textual data capture complementary risk signals.

4. Results

Performance: The Late-Fusion model achieved the highest performance with an AUC-ROC of 0.862.
- XGBoost (Structured only): 0.848
- ClinicalModernBERT (Text only): 0.855
Impact of Calibration: Calibration significantly improved the Brier score (from ~0.09 to ~0.04) without altering the AUC, ensuring the probabilities were reliable for risk grouping.
Risk Stratification Validity:
- The calibrated model successfully distributed trials across all risk categories.
- Monotonic Increase: There was a clear, monotonic increase in observed event rates across risk groups:
  - Low Risk: 0.62% event rate.
  - Moderate Risk: 2.74% event rate.
  - High Risk: 7.86% event rate.
  - Very High Risk: 18.80% event rate.
- Subgroup Analysis: The stratification remained robust across different clinical development stages (Phase I to III) and enrollment sizes, indicating the model captures design-level characteristics beyond simple proxies like trial size or phase.

5. Significance and Implications

Proactive Quality Management: The framework enables sponsors and regulators to identify high-risk trials during the planning phase, allowing for targeted protocol refinements (e.g., dosing schedules, monitoring plans) before participant enrollment.
Shift from Reactive to Anticipatory: Moves the industry from detecting errors after they occur to anticipating them based on trial design characteristics.
Scalability and Reproducibility: The automated pipeline allows for the continuous regeneration of datasets as new trials are registered, supporting ongoing safety monitoring.
Regulatory and Operational Value: By providing interpretable risk categories, the tool supports risk-based quality management (RBQM) strategies, helping to allocate resources efficiently to trials most likely to encounter dosing issues.

In conclusion, this study establishes that dosing error risks are encoded in the upstream design and protocol text of clinical trials. By leveraging multimodal machine learning with rigorous probability calibration, it is possible to create a scalable, early-warning system for clinical trial safety.