Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a massive crime: Cancer. You have a suspect list of 20,000 potential "culprits" (genes), but you know that only a tiny handful (maybe 20 or 30) are actually responsible for the disease. Your goal is to find the real culprits, build a profile of them, and predict how long a patient might survive.

However, you face three major problems:

The Needle in a Haystack: You have way more suspects (genes) than witnesses (patients).
The Alibi Network: Many suspects are friends with each other (correlated genes), making it hard to tell who is actually guilty.
The Missing Witness: Some patients drop out of the study before the crime is fully resolved (censored data), leaving you with incomplete information.

This paper is a massive "taste test" or "car crash test" for different detective tools (statistical methods) to see which one is best at solving this specific type of mystery.

The Contestants: The Detective Tools

The authors gathered a lineup of 9 different "detectives" (statistical methods) to see who performs best. They split them into two teams:

The "Embedded" Detectives (The All-in-Ones): These tools try to find the bad guys while they are building the profile. They do the investigation and the profiling simultaneously.
- Examples: LASSO, Adaptive LASSO, Elastic Net, CoxBoost, Random Survival Forests.
- Analogy: Imagine a detective who interviews suspects and builds the case file at the same time, instantly ignoring the innocent ones.
The "Filter" Detectives (The Screeners): These tools look at each suspect individually first to see if they look suspicious, then they build the case file with the survivors.
- Examples: Benjamini-Hochberg, Q-value, CARS.
- Analogy: Imagine a security guard at the door who checks ID cards one by one. If a card looks bad, they are kicked out before the detective even gets to talk to them.

The Training Ground: Synthetic Data

Before testing on real patients, the authors created 18 different "simulated crime scenes" (synthetic datasets).

They varied the difficulty: Sometimes the bad guys were easy to spot (strong signals), sometimes they were hiding well (weak signals).
They changed the rules: Sometimes the suspects were all friends with each other (high correlation), sometimes they were strangers (independent).
They changed the crowd size: Sometimes there were very few bad guys (low sparsity), sometimes more (high sparsity).

They ran every detective through every scenario 200 times to see who made the fewest mistakes.

The Real World Test: The Bladder Cancer Case

After the simulations, they took the tools to a real crime scene: The Cancer Genome Atlas (TCGA) Bladder Cancer dataset.

They had 423 real patients and 20,000 genes.
They used a "pre-screening" step to reduce the suspect list from 20,000 down to 3,000 (because 20,000 is too many for a computer to handle efficiently).
They ran the detectives on this real data to see how they performed when the "true answers" weren't known.

The Scorecard: How Did They Do?

The authors judged the detectives on two main skills:

Finding the Culprits (Feature Selection): Did they pick the right genes? Did they avoid picking innocent people (False Discoveries)?
Predicting the Future (Prognosis): Did they accurately predict how long a patient would live?

The Winners:

The Gold Medalists: Adaptive LASSO and CoxBoost were the most consistent winners. They were like the Sherlock Holmes of the group: they found the right suspects and predicted the timeline accurately, no matter how tricky the data was.
The Strong Runners-Up: LASSO and Elastic Net were also very good, especially at ranking patients by risk.
The "Filter" Surprise: The CARS filter (a specific screening method) was surprisingly good, especially when it used a new, smarter way to decide who to keep (called the "MSR" method).
The Underperformers:
- Benjamini-Hochberg and Q-value: These were the "overzealous security guards." In some easy tests, they were great at filtering out noise. But in harder, more realistic tests, they got confused by the "friend networks" (correlations) and let too many innocent people through, or kicked out the guilty ones.
- Random Survival Forests (RSF): These are powerful but slow. They struggled a bit until the authors gave them a "pre-screening" step (sRSF) to reduce the suspect list first. Once they had fewer suspects to look at, they became much faster and more accurate.

The Big Takeaway

If you are a cancer researcher trying to find biomarkers (the "bad genes") and predict patient survival:

Don't just use the old-school "screening" methods (like Benjamini-Hochberg) alone. They get confused when genes are correlated.
Use the "All-in-One" detectives. Specifically, Adaptive LASSO and CoxBoost are the most reliable tools for the job.
If you use the "Forest" method (RSF), make sure you filter the data first to reduce the noise, or it will get bogged down.

In short: The paper provides a "User Manual" for scientists, telling them exactly which mathematical tool to grab from the toolbox to solve the complex puzzle of cancer survival data, saving them time and preventing them from chasing false leads.

1. Problem Statement

The identification of biomarkers and the development of prognostic models using genomic data face significant challenges due to specific data characteristics:

High Dimensionality: The number of predictors (genes/features, $p$ ) far exceeds the number of observations ( $n$ ), often referred to as "high- $p$ , low- $n$ ."
Right-Censoring: Survival data often includes subjects who have not experienced the event (e.g., death) by the end of the study, requiring specialized statistical handling.
Correlation: Predictors (genes) are often highly correlated with one another.
Sparsity: Only a small fraction of the available features are truly informative (regulatory genes) regarding the survival outcome.

While numerous statistical and machine learning methods have been developed to address these issues (e.g., regularization, tree-based ensembles), there is a lack of large-scale, comprehensive benchmarks comparing their performance across diverse scenarios of sparsity, correlation, and signal strength specifically for right-censored time-to-event data.

2. Methodology

Methods Evaluated

The authors benchmarked nine prominent methods, categorized into Embedded (feature selection integrated into model fitting) and Filter (feature selection performed independently before modeling) approaches:

Embedded Methods:
- LASSO: Least Absolute Shrinkage and Selection Operator (L1 penalty).
- Adaptive LASSO (ALASSO): LASSO with adaptive weights to handle correlations better.
- Elastic Net (ENET): Combination of L1 and L2 penalties.
- CoxBoost (CB): Regularized gradient boosting for Cox proportional hazards.
- Random Survival Forest (RSF): Non-parametric ensemble of decision trees.
- Screened RSF (sRSF): RSF preceded by a univariate Cox screening step.
Filter Methods:
- Benjamini-Hochberg (BH): False Discovery Rate (FDR) control via univariate Cox p-values.
- Q-value (QV): FDR control using q-values.
- Correlation-Adjusted Regression Survival (CARS): A filter method adjusting for predictor correlations, using two thresholding techniques: MED (Maximal Euclidean Distance) and MSR (Minimal Sextic Residuals).

Experimental Design

The study utilized two distinct simulation settings and a real-world application:

Simulation Setting I (Synthetic Data):
- Generated 18 unique scenarios varying by:
  - Sparsity ( $s$ ): 2%, 5%, 10% (proportion of non-zero coefficients).
  - Correlation ( $\alpha$ ): 0 (independent) and 0.5 (correlated).
  - Signal Strength ( $\gamma$ ): 0.5 (weak), 1 (moderate), 2 (strong).
- Sample size: $n=300$ , Features: $p=1000$ .
- 200 datasets generated per scenario.
Simulation Setting II (TCGA-Mimic):
- Simulated data mimicking a real Bladder Cancer (BLCA) cohort from The Cancer Genome Atlas (TCGA).
- $n=423$ , $p=3000$ .
- True signals derived from real data estimates.
Real Data Analysis:
- Applied to the TCGA-BLCA cohort ( $n=423$ , initial $p=20,240$ ).
- Preliminary Feature Selection (PFS): Used CARS to reduce features to 3,000 before analysis.
- Validation: Nested 10-fold cross-validation.
- Ground Truth: A subset of 30 known literature-supported biomarkers was used to evaluate feature selection (though exhaustive ground truth is unknown in real data).

Performance Metrics

Feature Selection:
- False Discovery Rate (FDR): $FP / (TP + FP)$.
- F1-Score: Harmonic mean of Precision and Recall.
- Dice Coefficient: Used for stability in real data analysis.
Predictive Performance:
- Concordance Index (CI): Ranking ability of predicted risks.
- Brier Score: Accuracy of predicted survival probabilities (time-specific).
- Root Mean Squared Error (RMSE): Deviation between predicted and true event times (simulation only).
Efficiency: Computation time.

3. Key Contributions

Comprehensive Benchmarking: First large-scale comparison of embedded and filter methods specifically tailored for right-censored genomic data, covering a wider variety of methods and data characteristics than previous studies.
Novel Thresholding for CARS: Introduced and evaluated the MSR (Minimal Sextic Residuals) method for determining the elbow point in CARS scores, demonstrating it is more conservative and often superior to the standard MED method.
Dual-Setting Validation: Combined rigorous synthetic simulations (with known ground truth) with a realistic TCGA-mimic setting and a real-world application to ensure robustness.
Practical Guidance: Provided specific recommendations for researchers based on data characteristics (e.g., sparsity levels, correlation structures).

4. Key Results

Feature Selection Performance

FDR Control: BH and QV procedures consistently achieved the lowest FDR in low-sparsity synthetic settings but performed poorly (high FDR) in the TCGA-mimic setting and real data, often selecting too many features due to univariate screening limitations.
F1-Score (Selection Accuracy):
- ALASSO and CoxBoost consistently achieved the highest F1-scores across most scenarios.
- LASSO excelled in high-sparsity (10%) settings.
- Filter methods (BH, QV, CARS) generally underperformed in F1-score compared to embedded methods, except for CARS (MSR) which showed competitive results.
Stability: CARS (MSR) and BH/QV showed high stability (high Dice coefficient) in real data, whereas LASSO showed instability in the presence of multicollinearity.

Predictive Performance

Concordance Index (CI): LASSO, ALASSO, and ENET generally outperformed other methods. BH and QV performed no better than random guessing (CI $\approx$ 0.5).
Brier Score & RMSE:
- ALASSO and CoxBoost were the top performers in simulation settings.
- CARS (MSR) was the best filter method for Brier score.
- In the real data analysis, sRSF and CARS (MSR) showed excellent Brier scores, particularly at longer time horizons (1,000 days), outperforming parametric methods which tended to be over-optimistic.
Computation Time:
- CARS (MED) was the fastest.
- ALASSO was notably fast among embedded methods.
- RSF (without screening) was the most computationally intensive.

Real Data Insights (TCGA-BLCA)

Screening Necessity: Applying a preliminary filter (CARS) significantly reduced dimensionality and multicollinearity, improving the performance of downstream models.
Model Calibration: Parametric methods (LASSO, ALASSO) tended to be over-optimistic in survival probability predictions at longer horizons (3-5 years), while non-parametric methods (RSF) showed better calibration but segmented risk groups more aggressively.
Top Performers: ALASSO and CoxBoost were recommended for general use due to balanced performance in selection and prediction. CARS (MSR) was recommended as the best filter method for dimensionality reduction.

5. Significance

This study provides a critical roadmap for cancer researchers and bioinformaticians working with high-dimensional survival data.

Method Selection: It clarifies that while univariate filters (BH/QV) are good for strict FDR control in simple settings, they fail in complex, correlated genomic data. Embedded regularized methods (specifically ALASSO and CoxBoost) are generally superior for simultaneous feature selection and prognostic modeling.
Hybrid Approaches: The study validates the utility of using a robust filter (like CARS with MSR) as a pre-processing step before applying complex models, especially for non-parametric methods like RSF.
Reproducibility: The authors provide open-source R code and a detailed repository, enabling other researchers to replicate the benchmark and apply the findings to their own genomic datasets.

In conclusion, the paper argues against a "one-size-fits-all" approach, recommending ALASSO and CoxBoost for general applications, while highlighting the importance of CARS (MSR) for dimensionality reduction in highly correlated, high-dimensional genomic data.