Sample Size Calculations for Developing Clinical Prediction Models: Overview and pmsims R package

Imagine you are a chef trying to create the perfect recipe for a new dish. You want to make sure that when you serve this dish to thousands of people, it tastes great every single time. But here's the catch: How many times do you need to practice cooking this dish before you are confident it will work?

If you only cook it once or twice, you might get lucky, or you might burn it. You won't know if your recipe is truly good or just a fluke. This is exactly the problem doctors and researchers face when building Clinical Prediction Models. These are "recipes" (mathematical formulas) that predict if a patient will get sick, recover, or respond to a treatment based on their symptoms.

This paper is about solving the mystery of "How much data do we need to cook up a reliable medical prediction?"

Here is the breakdown in simple terms:

1. The Problem: Guessing the Wrong Amount of Ingredients

For years, researchers used a simple rule of thumb, like saying, "You need 10 eggs for every 1 cup of flour." In medicine, this was the "10 Events Per Variable" rule. If you have 10 symptoms you are tracking, you need data on at least 100 sick people.

The Problem: This rule is too simple. It's like saying "all cakes need the same amount of sugar." Some recipes are complex (like a multi-layered wedding cake), and some are simple (like a mug cake). If you use the simple rule for a complex machine-learning model (a very fancy, complex recipe), you might end up with a "flat" cake that tastes bad. The model might memorize the few patients you studied (overfitting) but fail miserably when it meets a new patient.

2. The Two Ways to Measure Success

The paper explains that researchers have been asking the wrong question. They usually ask: "On average, will this model work?"

The authors say we should ask a stricter question: "Can we guarantee that this model will work most of the time?"

The "Average" Approach: Imagine flipping a coin 100 times. On average, you get 50 heads. But if you only flip it 10 times, you might get 8 heads or 2 heads. If you build a model based on that small sample, it might be a fluke.
The "Assurance" Approach (The New Way): This is like saying, "I want to be 80% sure that if I flip this coin 100 times, I'll get close to 50 heads." The paper introduces a method to calculate the sample size needed to reach that 80% confidence level. It's not just about the average; it's about making sure the model is stable and reliable, not just lucky.

3. The Solution: The "pmsims" Simulator

The authors built a new tool called pmsims (which you can think of as a "Virtual Cooking Simulator").

Instead of just using a formula, this tool runs thousands of virtual experiments on a computer:

It creates fake patients: It generates thousands of made-up medical records that look just like real ones.
It tests different group sizes: It tries training the model on 100 fake patients, then 500, then 1,000, then 5,000.
It draws a "Learning Curve": Imagine a graph where the X-axis is "How much data we have" and the Y-axis is "How good the model is." The tool draws this curve to see exactly where the line levels off.
It finds the sweet spot: It tells you the exact number of patients you need to collect so that the model hits your target performance with high confidence.

Why is this cool?

It's flexible: It works for simple math models and super-complex AI models (like the ones used in self-driving cars).
It's efficient: Instead of running millions of slow simulations, it uses a smart "Gaussian Process" (think of it as a smart guesser) to predict the curve and find the answer faster.

4. What They Found (The Taste Test)

The authors tested their new simulator against old methods using three different "recipes" (case studies).

The Result: The old methods gave wildly different answers. One said you needed 200 patients; another said 15,000!
The Reality: The new pmsims tool gave answers that were in the middle but much more reliable. It showed that for complex AI models, you often need 5 to 10 times more data than the old simple rules suggested.

5. The Future: What's Missing?

The paper admits that while their tool is great, the real world is messy.

Missing Ingredients: Real medical data often has missing pieces (like a patient forgetting to fill out a form). The tool needs to get better at handling that.
Group Dynamics: Sometimes patients are related (like families) or seen over time. The tool needs to handle these complex connections better.
Fairness: The tool needs to ensure the model works equally well for everyone, regardless of their background, to avoid bias.

The Bottom Line

This paper is a call to stop guessing how much data we need for medical AI. It provides a smart, flexible simulator that helps researchers figure out the exact amount of data required to build a model that is not just "okay on average," but reliable and trustworthy for real patients.

In short: Don't just bake a cake once and hope it works. Use a simulator to figure out exactly how much practice you need to guarantee a perfect cake every time.

1. Problem Statement

The development of clinical prediction models (CPMs) is critical for healthcare decision-making, yet determining the appropriate minimum sample size for model development remains a significant, unresolved challenge.

Consequences of Inadequate Sample Sizes: Insufficient data leads to overfitting, poor generalizability, biased predictions, and unstable individual risk estimates.
Limitations of Existing Methods:
- Heuristics (e.g., 10-20 Events Per Variable): Oversimplified, ignoring predictor strength, correlations, and model complexity.
- Closed-form Analytic Formulas (e.g., Riley et al., pmsampsize): Fast and interpretable but rely on strict distributional assumptions (e.g., linear/logistic regression) and struggle with complex data structures, non-linearities, and machine learning (ML) models.
- Simulation-based Approaches: Offer flexibility for complex data and ML but are often computationally expensive and lack user-friendly, standardized software implementations.
The "Mean vs. Assurance" Gap: Most existing methods calculate sample sizes based on the mean expected performance. They fail to account for the variability across different development datasets, meaning a model might meet performance targets on average but fail in specific instances. There is a lack of tools that guarantee performance with high probability (assurance).

2. Methodology

The authors propose a conceptual framework and a new software tool, pmsims, to address these gaps.

A. Conceptual Framework

The paper distinguishes between two criteria for sample size determination:

Mean-based Criterion: Finds the smallest $n$ such that the expected performance exceeds a target ( $M^*$ ).
$\min n: E_{D_n}[E(M|D_n)] > M^*$
Assurance Criterion: A stricter formulation requiring that performance exceeds $M^*$ $M^{*}$ with a high probability (e.g., 80%). This accounts for the variance in model performance across different random samples of size $n$ $n$ .
$\min n: P_{D_n}(E(M|D_n) > M^*) > \delta$
- Significance: The assurance criterion ensures that the model will likely achieve the target performance in the majority of development scenarios, not just on average. This is crucial for unstable models (e.g., deep neural networks) where small data changes cause large performance swings.

B. The pmsims Approach

The pmsims R package implements a simulation-based, model-agnostic workflow that integrates three key components:

Data Generation: Users define a data generator (standard or custom) to simulate predictors and outcomes (binary, continuous, time-to-event) reflecting the target population's characteristics (prevalence, correlations, noise).
Learning Curve Estimation: The package iteratively simulates datasets of varying sizes ( $n$ ), fits the user-defined model (logistic regression, ML, etc.), and evaluates performance on independent test sets.
Gaussian Process (GP) Optimization: To reduce computational burden, pmsims uses GP surrogate modeling to approximate the learning curve ($Performance = g(n)$). It efficiently searches for the minimum $n$ where the 20th percentile of the performance distribution (representing 80% assurance) exceeds the target threshold.

Workflow Steps:

Define Scenario: Specify outcome type, predictors, model class, performance metric (e.g., AUC, Calibration Slope), and target performance ( $M_{ideal}$ ) with acceptable deviation.
Tune Generator: Calibrate the data generator to ensure the model achieves the target performance at large sample sizes.
Estimate Curve: Generate synthetic data, fit models, and use GP regression to interpolate the learning curve.
Determine $n$ : Identify the smallest sample size where the lower bound of the confidence interval (assurance) meets the target.

3. Key Contributions

Novel Framework: Formalizes the distinction between mean-based and assurance-based sample size calculations, advocating for the latter to ensure robustness.
pmsims Software: Introduces an open-source, model-agnostic R package that combines simulation, learning curves, and GP optimization. It supports:
- Any prediction model (from logistic regression to Random Forests and Neural Networks).
- User-defined performance metrics.
- Custom data generators.
Comparative Analysis: Provides a comprehensive review of existing methods (heuristic, analytic, simulation) and a taxonomy of their strengths and limitations (Table 1).

4. Results

The authors validated the approach through three case studies comparing pmsims against existing tools (pmsampsize, samplesizedev, Silvey & Liu's Shiny app, and various empirical formulas).

Variability in Estimates: Sample size estimates varied drastically across methods.
- Logistic Regression: Estimates ranged from 200 to 6,000 depending on the method.
- Machine Learning: ML models required significantly larger datasets (5–10x more than regression) and showed greater variability between methods.
Impact of Misspecification: When the prediction model did not match the data-generating mechanism (e.g., using Linear Regression on non-linear data), required sample sizes increased dramatically (e.g., >20,000).
pmsims Performance:
- pmsims estimates (e.g., 3,510 for Case 1) fell within the middle of the ranges provided by other methods.
- Crucially, pmsims provided assurance-based estimates (guaranteeing performance in 80% of cases), whereas many other tools provided only mean-based estimates.
- The package demonstrated flexibility in handling diverse metrics (AUC, Calibration Slope, MAPE) and model types.

5. Significance and Future Directions

Practical Impact: pmsims bridges the gap between theoretical sample size methodology and practical application, offering researchers a tool to design robust studies for both classical statistics and modern ML.
Addressing Overfitting: By prioritizing assurance over mean performance, the tool helps prevent the deployment of models that appear successful on average but fail in practice due to data variability.
Future Work: The authors identify several areas for extension:
- Handling hierarchical, clustered, and longitudinal data (common in health records).
- Integrating missing data mechanisms into sample size calculations.
- Incorporating fairness and stability metrics to ensure equitable performance across subgroups.
- Developing more sophisticated data generators (e.g., using GANs) to better mimic complex real-world clinical data.

In conclusion, this paper advances the field of clinical prediction modelling by moving beyond simple heuristics and rigid formulas toward a flexible, simulation-driven framework that prioritizes the reliability and generalizability of predictive models through the assurance criterion.