Design of Bayesian Clinical Trials with Clustered Data

Imagine you are a chef planning a massive banquet for a new community. You want to test a new recipe (a new medicine) against the old standard recipe to see if the new one is just as safe.

But here's the catch: You aren't cooking for individuals one by one. You are cooking for groups (like families or households). If one family member gets sick from the food, the whole family might react similarly because they share the same kitchen, the same ingredients, and the same environment. In statistics, we call this clustered data.

This paper, written by Luke Hagar and Shirin Golchi, tackles a very expensive and time-consuming problem: How do you figure out how many groups (clusters) you need to invite to your banquet to be sure your new recipe works, without actually cooking the whole meal a thousand times?

Here is the breakdown using simple analogies:

1. The Problem: The "Taste-Test" Bottleneck

In the past, to design a clinical trial, statisticians had to play a game of "What If?" thousands of times on a computer.

The Old Way: They would simulate the trial with 100 groups, then 101, then 102, all the way up to 200. For each number, they had to run a complex computer simulation (like a high-end video game) to see if the new medicine looked safe.
The Pain: This is like trying to find the perfect amount of salt in a soup by cooking the entire pot from scratch, tasting it, throwing it away, and starting over again for every single pinch of salt you want to test. It takes forever and costs a fortune in computing power.

2. The Solution: The "Two-Point" Shortcut

The authors discovered a mathematical magic trick. They realized that if you look at the results of your "taste tests" (the computer simulations) at just two specific group sizes, you can draw a straight line through them and predict the results for any other group size in between.

The Analogy: Imagine you are trying to guess how tall a tree will be in 10 years.
- The Old Way: You measure the tree every single day for 10 years.
- The New Way: You measure the tree today (at 100 groups) and measure it again in 4 years (at 140 groups). Because trees grow in a predictable, steady way, you can draw a straight line between those two points and accurately guess how tall it will be at 115 groups, 120 groups, or 130 groups. You don't need to wait or measure the middle points.

3. The "Magic Line" (The Theory)

The paper proves mathematically that for these types of group-based trials, the relationship between the number of groups and the "safety score" of the medicine is almost a straight line.

The Logit Line: They use a specific mathematical curve (called a "logit") that turns the messy, wiggly probability numbers into a straight line.
The Result: Instead of running 10,000 simulations to check 100 different group sizes, they only need to run simulations for two sizes (e.g., 100 groups and 140 groups). They draw the line, and boom—they know the answer for every number in between.

4. Why This Matters (The "Banquet" Outcome)

In the real world, this saves a massive amount of time and money.

Speed: In their example, calculating the results for a whole range of group sizes used to take 35 minutes of heavy computer time. With their new method, it took only 8 minutes.
Accuracy: They showed that their "straight line" guess was almost identical to the "cook everything from scratch" method.
Confidence: They also built a "safety net" (called a bootstrap confidence interval) to tell the researchers, "We are 95% sure the right number of groups is between 114 and 116."

5. The Real-World Example: Tuberculosis

The authors tested this on a real-world scenario involving a trial for Tuberculosis (TB) prevention.

The Setup: Families (clusters) were given either a new TB drug or the old one.
The Goal: To prove the new drug was "non-inferior" (just as safe) as the old one.
The Challenge: Because families share environments, their health outcomes are linked. This makes the math very hard.
The Win: Using their shortcut, they quickly determined that they needed about 115 to 129 families (depending on how much the families influenced each other) to be confident in the results. Without this method, designing this trial would have been a computational nightmare.

Summary

This paper is like giving a chef a predictive ruler. Instead of tasting the soup 100 times to find the perfect recipe, the chef tastes it twice, draws a line, and knows exactly how much salt to add for any size crowd. It makes designing medical trials faster, cheaper, and more efficient, ensuring that new medicines can be tested and approved without wasting resources.

1. Problem Statement

The design of Bayesian clinical trials requires the assessment of operating characteristics (e.g., statistical power and Type I error rates) to satisfy regulatory requirements (e.g., FDA guidelines).

The Bottleneck: Standard practice involves Monte Carlo simulation to estimate the sampling distribution of posterior summaries. For each design configuration (specifically, different sample sizes), thousands of trial repetitions must be simulated.
The Challenge with Clustered Data: In trials with clustered data (e.g., cluster-randomized trials or longitudinal studies), the analysis often requires complex, high-dimensional models with random effects. Furthermore, when marginal estimands (population-average effects) are required, additional computational steps (like Bayesian G-computation) are needed to marginalize over random effects and covariates.
Consequence: Repeating these computationally intensive simulations across a wide range of cluster counts ( $c$ ) to find the optimal sample size is prohibitively expensive and time-consuming. Existing efficient methods (e.g., Hagar & Stevens, 2025) were limited to independent observations and could not handle the dependencies inherent in clustered data.

2. Methodology

The authors propose a novel, computationally efficient framework to determine the required number of clusters ( $c$ ) for Bayesian trials with clustered data. The core innovation relies on proving that the logits of posterior probabilities behave as linear functions of the number of clusters in large samples.

A. Theoretical Foundation

The method is grounded in the Bernstein-von Mises (BvM) theorem, which establishes the asymptotic normality of the posterior distribution.

Proxy Distribution: The authors define a proxy to the sampling distribution of posterior probabilities ( $\tau(D_c)$ ). They show that under regularity conditions, the posterior distribution of the estimand $\delta(\theta)$ approximates a normal distribution $N(\hat{\delta}^{(c)}, c^{-1}\Lambda(\theta))$ .
Linearity of Logits: They prove Theorem 1, which states that for a sufficiently large number of clusters $c$ $c$ , the logit of the posterior probability that the alternative hypothesis is true ( $\text{logit}(\tau^{(c)}_r)$ $logit (τ_{r}^{(c)})$ ) is an approximately linear function of $c$ $c$ .
- Mathematically: $\lim_{c \to \infty} \frac{d}{dc} \text{logit}(\tau^{(c)}_r) = \text{constant}$ .
Implication: This linearity implies that the quantiles of the sampling distribution of the logit-transformed posterior probabilities change linearly with the number of clusters.

B. The Sample Size Determination (SSD) Algorithm

Based on the theoretical linearity, the authors propose Algorithm 1, which drastically reduces the simulation burden:

Select Two Points: Choose two cluster counts, $c_0$ (e.g., based on budget) and $c_1$ (chosen to bracket the target power).
Simulate: Perform full Monte Carlo simulations (generating data, fitting models, calculating posterior probabilities) only at $c_0$ and $c_1$ for both the null ( $\Psi_0$ ) and alternative ( $\Psi_1$ ) hypothesis scenarios.
Linear Interpolation:
- Calculate the logit of the posterior probabilities for each simulation repetition at $c_0$ and $c_1$ .
- Construct linear approximations (lines) connecting the order statistics of these logits between $c_0$ and $c_1$ .
- Extrapolate/interpolate these lines to estimate the logits (and thus the probabilities) for any other cluster count $c$ .
Determine Optimal $c$ : Find the smallest integer $c$ where the estimated power meets the target (e.g., $1-\beta $) while maintaining the Type I error rate below$ \alpha$.
Uncertainty Quantification: Use bootstrap resampling of the simulation results at $c_0$ and $c_1$ to construct confidence intervals for the recommended cluster count, quantifying the variability introduced by the simulation process.

3. Key Contributions

Extension to Clustered Data: The paper extends previous efficient SSD methods (which were limited to independent data) to cluster-randomized trials and longitudinal studies, accommodating random effects and marginal estimands.
Theoretical Proof: It provides the first rigorous proof that the logit of posterior probabilities in clustered Bayesian designs follows a linear trend with respect to the number of clusters, justifying the use of two-point estimation.
Computational Efficiency: The method reduces the computational cost from simulating across a grid of many sample sizes to simulating at only two cluster counts.
Robustness: The framework handles marginal estimands (requiring G-computation) and accounts for simulation variability via bootstrap confidence intervals.

4. Results

The authors validated the method using an illustrative example inspired by the SSTARLET trial (a cluster-randomized trial for latent tuberculosis treatment).

Setup: They evaluated non-inferiority across four scenarios (clearly acceptable, acceptable, barely acceptable, and unacceptable treatments) and three Intraclass Correlation Coefficient (ICC) settings (low, moderate, high).
Performance:
- Accuracy: The operating characteristics (power and Type I error) estimated using the linear approximation (based on $c_0=100$ and $c_1=140$ ) showed excellent alignment with "true" operating characteristics obtained by simulating 9 different cluster counts ( $c=80$ to $160$).
- Efficiency:
  - Standard Method: Simulating 9 cluster counts took ~35 minutes.
  - Proposed Method: Estimating the curve using only two points took ~8 minutes.
- Recommendations: The method successfully identified optimal cluster counts (e.g., $c=115$ for low ICC, $c=129$ for high ICC) with narrow bootstrap confidence intervals (e.g., $[114, 116]$ ).
- Sensitivity: The results demonstrated that optimal cluster counts are highly sensitive to ICC settings, highlighting the necessity of exploring these parameters efficiently.

5. Significance

Practical Impact: This methodology makes the design of complex Bayesian cluster-randomized trials feasible within realistic timeframes and computational budgets. It allows researchers to explore a wider range of design scenarios (different ICCs, covariates, and effect sizes) without prohibitive costs.
Regulatory Acceptance: By providing a rigorous, simulation-backed method to estimate frequentist operating characteristics (power/Type I error) for Bayesian designs, it addresses a key hurdle in regulatory approval for Bayesian trials.
Future Directions: The paper outlines extensions to adaptive designs (group sequential trials) and platform trials with multiple endpoints, suggesting that the underlying theory of linear trends in sampling distributions can be further generalized to more complex trial structures.

In summary, Hagar and Golchi provide a mathematically sound and computationally superior alternative to traditional "brute-force" simulation for designing Bayesian clinical trials with clustered data, enabling more robust and efficient trial planning.