Designing clinical trials for the comparison of single and multiple quantiles with right-censored data

Imagine you are a doctor trying to decide between two treatments for a serious illness. Traditionally, researchers have used a complex metric called the "Hazard Ratio" to compare them. Think of the Hazard Ratio like a weather forecast that says, "There is a 20% higher chance of rain tomorrow." It's abstract, and it doesn't tell you how much longer you will live if you take the new drug.

This paper introduces a much more human-friendly way to look at the data: Quantiles.

The Core Idea: "How Long Will I Live?"

Instead of abstract probabilities, the authors suggest looking at specific time milestones.

The Median (50th percentile): "How long until half of the patients have passed away?"
The 90th percentile: "How long until 90% of patients have passed away?"

If Treatment A gives a median survival of 12 months and Treatment B gives 18 months, the answer is simple: Treatment B buys you an extra 6 months of life. This is easy for patients and doctors to understand.

The Problem: The "Censored" Mystery

In clinical trials, not everyone dies during the study. Some patients drop out, move away, or the study ends before they pass away. In statistics, this is called right-censored data. It's like a race where some runners are still running when the whistle blows. You know they finished at least that far, but you don't know their final time.

Because of these "incomplete" runners, it's very hard to calculate the exact "density" (how crowded the finish line is at any specific moment) needed to prove that one treatment is truly better than the other.

The Old Way vs. The New Way

The Old Way (Kernel Density Estimation):
Imagine trying to guess the shape of a mountain by taking a photo of the whole landscape and smoothing it out with a blurry filter. You have to estimate the height of the mountain at every single point on the map, even the parts you don't care about. This is slow, requires a lot of guesswork (tuning a "bandwidth" knob), and can get messy.

The New Way (Resampling/LS Method):
The authors propose a smarter trick. Instead of mapping the whole mountain, they say, "Let's just zoom in on the specific peak we care about."
They use a technique called Resampling. Imagine you have a bag of marbles representing your data. You shake the bag, pull out a handful, measure the height of the peak, put them back, and repeat this thousands of times. By looking at the pattern of these thousands of "mini-experiments," they can calculate the exact height of the peak (the density) right where they need it, without needing to map the whole mountain or guess any blurry filters.

The "Power" of the Formula

The biggest contribution of this paper is a new calculator (a power formula).
Before this, if a researcher wanted to design a new clinical trial, they had to guess: "How many patients do we need to recruit to prove our drug works?" They often had to run expensive computer simulations to guess the answer.

Now, thanks to this paper, researchers have a blueprint. They can plug in their desired results (e.g., "We want to detect a 3-month difference in survival") and the formula instantly tells them: "You need exactly 500 patients." This saves time, money, and ensures the trial is big enough to find the truth.

Real-World Test: The OAK Study

The authors tested their new method on a real lung cancer trial (the OAK study). In this trial, the new drug (immunotherapy) didn't work immediately; it took time to kick in. This broke the rules of traditional statistics (which assume the benefit is constant).

The Result: Their new method successfully detected that the immunotherapy group lived significantly longer.
The Comparison: When they compared their "zoom-in" resampling method against the old "blurry photo" method, the new method gave stronger, more reliable evidence, especially when looking at multiple time points at once.

The Takeaway

This paper is like giving doctors and researchers a GPS and a ruler for clinical trials.

The GPS: It helps them design the trial correctly from the start so they don't waste resources.
The Ruler: It measures success in "months of life gained" rather than abstract numbers, making the results clear for patients.
The Zoom Lens: It uses a clever resampling trick to get accurate measurements even when the data is messy or incomplete.

In short, it makes clinical trials more efficient, more accurate, and much easier for humans to understand.

Here is a detailed technical summary of the paper "Designing clinical trials for the comparison of single and multiple quantiles with right-censored data."

1. Problem Statement

In clinical trials involving right-censored survival data, the Hazard Ratio (HR) is the standard metric for comparing treatment arms. However, the HR relies on the proportional hazards (PH) assumption, which is frequently violated in modern therapies, particularly in immuno-oncology where treatment effects are delayed (late separation of survival curves). Furthermore, the HR is a relative measure that can be difficult for clinicians and patients to interpret intuitively.

Quantiles of survival time (e.g., median survival time) offer a robust alternative because they:

Are expressed in the time scale (e.g., "months gained"), making them clinically intuitive.
Do not require the proportional hazards assumption.
Are robust to outliers.

The Gap: While methods exist to compare quantiles (e.g., Brookmeyer-Crowley, Tang-Jeong), they suffer from limitations:

They often rely on strong assumptions (e.g., identical survival distributions under the null) that inflate Type I error.
They typically address only single quantiles, lacking a framework for comparing multiple quantiles simultaneously.
Crucially, no explicit power formulas or sample size calculation methods existed for these tests, making trial design difficult.
Existing methods for estimating the variance of the test statistic (which depends on the probability density function at the quantile) often use kernel density estimators (KDE), which require bandwidth selection and have slow convergence rates.

2. Methodology

The authors extend the nonparametric two-sample test for quantile equality originally proposed by Kosorok (1999). The methodology is divided into theoretical derivation, variance estimation, and power calculation.

A. Statistical Framework

The study considers a two-arm trial ( $n_1, n_2$ ) with right-censored data. Let $F_k$ be the cumulative distribution function of failure times for group $k$ . The goal is to test:

Univariate: $H_0: F_1^{-1}(p) = F_2^{-1}(p)$ vs. $H_1: F_1^{-1}(p) - F_2^{-1}(p) = \Delta$ .
Multivariate: $H_0: F_1^{-1}(p_j) = F_2^{-1}(p_j)$ for $j=1, \dots, J$ vs. $H_1$ : at least one difference exists.

The test statistic relies on the asymptotic normality of the Kaplan-Meier quantile estimator. The variance of this estimator depends on the probability density function (PDF), $f_k$ , evaluated at the quantile of interest.

B. Density Estimation Innovation

A critical bottleneck in applying Kosorok's test is estimating the density $f_k$ at the specific quantile.

Original Approach (Kosorok): Uses Kernel Density Estimation (KDE). This requires estimating density over the entire support and selecting a bandwidth parameter, which is computationally intensive and prone to variability.
Proposed Approach (Resampling/LS): The authors propose a resampling-based method inspired by Lin et al. This method generates multiple realizations of a centered Gaussian variable and uses least squares estimation to directly estimate the density only at the specific quantile of interest.
- Advantage: It avoids bandwidth selection and estimates the parameter directly, leading to a faster rate of convergence and lower Mean Squared Error (MSE) compared to KDE.

C. Power and Sample Size Derivation

The core theoretical contribution is the derivation of closed-form asymptotic power formulas:

Univariate Case: The test statistic follows a standard normal distribution under $H_0$ . The power is derived based on the non-centrality parameter involving the sample size ( $n$ ), the quantile difference ( $\Delta$ ), and the estimated variance ( $\hat{\sigma}^2$ ).
Multivariate Case: The test statistic follows a Chi-squared distribution ( $\chi^2_J$ ) under $H_0$ and a non-central Chi-squared distribution under $H_1$ .
Application: These formulas allow researchers to calculate the minimum sample size required to detect a specific difference in quantiles with a desired power ($1-\beta $) and significance level ($ \alpha$).

3. Key Contributions

Explicit Power Formulas: The paper provides the first analytical expressions for calculating power and sample size for both single and multiple quantile comparisons in the presence of right-censored data.
Improved Density Estimation: Introduction of a resampling-based (Least Squares) method for estimating the density at the quantile, which outperforms traditional KDE in terms of accuracy and computational efficiency for this specific task.
Multivariate Testing Framework: A unified approach to test the equality of a collection of quantiles simultaneously, controlling the family-wise error rate, which is essential for capturing complex survival curve shapes (e.g., delayed effects).
Trial Design Tool: A practical toolkit for designing clinical trials where the PH assumption is violated, enabling the use of "time gained" as a primary or co-primary endpoint.

4. Results

The authors validated their methods through simulations and a real-world application.

A. Simulation Studies

Scenarios: Tested under both Proportional Hazards (Exponential distributions) and Non-Proportional Hazards (Piecewise exponential with delayed effects).
Performance:
- The analytical power formulas provided accurate approximations of empirical power even with modest sample sizes ( $n=50$ to $500$).
- Type I error rates were well-controlled (close to the nominal 0.05 level).
- The resampling (LS) method for density estimation consistently yielded higher statistical power and more significant p-values compared to KDE, particularly in scenarios with non-proportional hazards.

B. Real Data Application: The OAK Trial

The method was applied to the OAK phase III randomized clinical trial (Atezolizumab vs. Docetaxel in metastatic non-small-cell lung cancer), a study known for non-proportional hazards.

Univariate Results: At the median ( $p=0.5$ ), the immunotherapy group showed a survival benefit of 4.04 months over chemotherapy. The proposed test (using LS density estimation) yielded a highly significant p-value ($5.03 \times 10^{-4}$), outperforming the Tang-Jeong test.
Multivariate Results:
- When testing pairs of quantiles, the LS method detected significant differences where KDE failed (e.g., in the pair $\{0.1, 0.5\}$ , LS rejected $H_0$ while KDE did not).
- A global multivariate test on quantiles $\{0.05, 0.1, 0.5, 0.7\}$ strongly rejected the null hypothesis ( $p = 5.87 \times 10^{-5}$ ).
- Post-hoc univariate tests (with Bonferroni correction) localized the significant differences specifically to the median ($0.5 $) and the 70th percentile ($ 0.7$), while lower quantiles showed no difference. This highlights the "delayed effect" nature of the treatment.

5. Significance and Implications

Addressing Non-Proportional Hazards: This work provides a rigorous statistical framework for analyzing trials where treatment effects emerge late, a common scenario in modern oncology that renders standard HR-based designs suboptimal.
Clinical Interpretability: By focusing on quantiles (time gained), the results are directly communicable to patients and clinicians, bridging the gap between statistical significance and clinical relevance.
Efficient Trial Design: The derived power formulas allow sponsors to accurately calculate sample sizes for quantile-based endpoints, potentially reducing the cost and duration of trials by avoiding over- or under-powering.
Methodological Robustness: The proposed resampling density estimator offers a superior alternative to kernel methods for this specific application, improving the reliability of statistical inference in survival analysis.

In conclusion, this paper fills a critical gap in clinical trial methodology by enabling the design and analysis of studies based on survival quantiles, offering a robust solution for the increasingly common non-proportional hazards scenarios in medical research.