Comparing Variable Selection and Model Averaging Methods for Logistic Regression

Imagine you are a detective trying to solve a mystery. You have a list of 50 potential suspects (predictors), but you know that only a few of them actually committed the crime. Your goal is to figure out who did it and how they did it, without getting distracted by the innocent bystanders.

In the world of statistics, this is called Logistic Regression. It's a tool used to predict binary outcomes (like "Will the patient get sick?" or "Will the customer buy the product?"). The problem is, just like in your detective story, you often don't know which "suspects" (variables) are the real culprits. This uncertainty is the main challenge this paper tackles.

The authors, a team of statisticians from the University of Amsterdam and the University of Washington, decided to run a massive taste test. They gathered 28 different "recipes" (statistical methods) that people use to solve this mystery. They cooked up 11 different scenarios using real-world data (from heart disease to voting patterns) and tested every recipe to see which one gave the most accurate, stable, and fast results.

Here is the breakdown of their findings, translated into everyday language:

The Two Big Teams

The 28 methods fell into two main camps:

The "All-Hands-On-Deck" Team (Bayesian Model Averaging):
- The Philosophy: Instead of picking just one suspect, this team says, "Let's look at every possible combination of suspects, weigh the evidence for each, and take a weighted average of all the answers."
- The Analogy: Imagine asking 100 different detectives to write down their theories. Instead of picking the loudest one, you take a vote, giving more weight to the detectives who have a better track record.
- The Star Performer: The Benchmark Prior (specifically the $g = \max(n, p^2)$ method). When the data was "clean" (no weird glitches), this method was the Sherlock Holmes of the group. It was incredibly accurate at finding the right variables and predicting outcomes.
The "Sculptor" Team (Penalized Likelihood / LASSO):
- The Philosophy: These methods start with all the suspects and then aggressively carve away the ones that don't seem important, forcing their influence down to zero.
- The Analogy: Imagine a sculptor with a block of marble (all the data). They chip away everything that isn't the statue, leaving only the essential shape.
- The Star Performer: The LASSO and its smoother cousin, the Induced Smoothed LASSO. These were the heavy hitters when things got messy.

The "Glitch" in the Matrix: Separation

The most important discovery in this paper involves a specific problem called Separation.

What is it? Imagine you are trying to predict if it will rain. You have a variable: "Is it raining right now?" If it is raining, it will definitely rain. If it's not raining, it won't. The predictor perfectly predicts the outcome. In statistics, this breaks the math. The numbers go to infinity, and the computer crashes.
The Real-World Impact: This happens often in small studies or when you have too many variables compared to the number of people you surveyed.

The Results with Separation:

The "All-Hands" Team (Bayesian): Most of them stumbled. Their math got confused by the perfect prediction, and their estimates went haywire.
The "Sculptor" Team (Penalized): They didn't panic. Because they use "regularization" (a mathematical safety net that prevents numbers from getting too huge), methods like LASSO and Elastic Net remained stable. They kept working even when the math was screaming.
The Surprise Hero: One Bayesian method, called EB-local, was the only one from the "All-Hands" team that didn't crumble. It was like a Swiss Army knife that worked well whether the data was clean or messy.

The Losers (Who to Avoid)

The paper also highlighted some methods that are popular but perform poorly:

Stepwise Selection (Forward/Backward): These are the "old school" methods where you add or remove variables one by one based on a simple rule (like a p-value). The study found these to be slow, unstable, and prone to picking the wrong suspects. It's like a detective who only follows the first clue they see and ignores the rest.
P-value Thresholds: Relying on arbitrary cut-offs (like "p < 0.05") to decide what to include was shown to be unreliable, especially when separation occurred.

The Final Verdict: What Should You Do?

The authors give you a simple decision tree based on their findings:

If your data is "clean" (no separation issues):
Go with the Bayesian Model Averaging methods, specifically the Benchmark Prior ( $g = \max(n, p^2)$ ). It's the most accurate and gives you a nice, full picture of the uncertainty. It's like having a super-smart AI that considers every angle.
If your data is "messy" (small sample size, many variables, or separation):
Go with Penalized Likelihood methods like LASSO or Elastic Net. They are the sturdy workhorses that won't crash when the math gets weird. They might not give you a full probability distribution, but they will give you a stable answer.
If you aren't sure which situation you are in:
Use the EB-local method. It's the "Jack of all trades." It performed very well in both clean and messy scenarios, making it a safe, robust default choice.

Why This Matters

For decades, researchers have been guessing which statistical tool to use. Some used the "old school" step-by-step methods because they were easy to find in software. Others used complex Bayesian methods because they sounded fancy.

This paper is like a Consumer Reports for statistical methods. It says: "Stop guessing. If you want accuracy, use the Benchmark Prior. If you want stability in tough conditions, use LASSO. And if you want a safe bet, use EB-local."

By testing these methods on real-world data structures rather than just made-up numbers, the authors have given scientists, doctors, and data analysts a clear map to navigate the fog of model uncertainty.

Here is a detailed technical summary of the paper "Comparing Variable Selection and Model Averaging Methods for Logistic Regression."

1. Problem Statement

The paper addresses the critical challenge of model uncertainty in logistic regression, where it is unclear which subset of predictors should be included in the model. While numerous methods exist for variable selection and inference (both Bayesian and frequentist), their relative performance under realistic empirical conditions—specifically regarding separation (where a linear combination of predictors perfectly classifies the outcome) and varying sample sizes ( $n$ ) versus predictor counts ( $p$ )—remains poorly understood.

Previous comparisons were often limited to linear regression, relied on fully synthetic data, or failed to systematically evaluate the impact of separation. The authors aim to fill this gap by conducting a comprehensive, preregistered comparison of 28 established methods to provide practical guidance for researchers.

2. Methodology

Study Design

Preregistration: The study was preregistered at OSF to ensure transparency and reduce researcher degrees of freedom.
Simulation Framework: The authors employed an M-closed framework using parametric bootstrapping. Instead of generating purely synthetic data, they simulated outcomes based on 11 real-world empirical datasets spanning diverse fields (medicine, genetics, social sciences, etc.).
Data Characteristics: The 11 datasets varied significantly in sample size ( $n$ ) and number of predictors ( $p$ ), ranging from low-dimensional ( $n \gg p$ ) to high-dimensional ( $p \gg n$ , e.g., the Singh dataset with 102 observations and 6,033 predictors).
Separation Handling: The simulation design explicitly monitored for separation. Results were stratified into two groups: datasets without separation and datasets with separation.

Methods Evaluated (28 Total)

The study compared a wide array of approaches:

Bayesian Model Averaging (BMA): Implemented primarily via the BAS package. Variants included:
- g-priors: Fixed ( $g=n, g=4, g=\sqrt{n}, g=\max(n, p^2)$ ), Empirical Bayes (Local and Global), Hyper-g, Hyper-g/n, Beta-prime, CCH, Robust, Intrinsic, and AIC/BIC approximations.
- Spike-and-Slab: Implemented via BoomSpikeSlab.
Penalized Likelihood (Frequentist):
- LASSO, Ridge, Elastic Net, SCAD, MCP.
- Induced Smoothed LASSO (IS-LASSO) and Firth's bias-reduced logistic regression.
Classical Selection:
- Stepwise selection (Forward, Backward, Both).
- P-value based selection ( $p < 0.05, p < 0.005$ ).
- BIC-based selection (BIC.BMA).

Performance Metrics

Methods were ranked based on standardized metrics (lower is better):

RMSE: Root Mean Squared Error of coefficient estimates.
MIS: Mean Interval Score (assessing confidence/credible interval quality).
Brier Score: Accuracy of probabilistic predictions.
AUPRC: Area Under the Precision-Recall Curve (model selection quality).
Ranking: A "Partial Score" (average of RMSE and Brier) was used for final ranking. CPU time and failure rates were reported but not used for ranking.

3. Key Contributions

Systematic Empirical Evaluation: This is one of the most extensive comparisons of logistic regression methods to date, covering 28 methods across 11 real-data-based simulations.
Stratification by Separation: A novel contribution is the explicit separation of results based on the presence or absence of separation, revealing that method performance varies drastically depending on this condition.
Preregistered Replication: The study serves as a conceptual replication of Porwal and Raftery's linear regression work, extending findings to the non-linear logistic setting.
Open Science: All code, data, and preregistration documents are publicly available, setting a standard for reproducible methodological research.

4. Key Results

A. Datasets Without Separation

Top Performers: Bayesian Model Averaging (BMA) methods based on g-priors dominated.
- The Benchmark Prior ( $g = \max(n, p^2)$ ) achieved the best overall performance.
- Close runners-up included BIC.BAS, CCH, Hyper-g/n, and Beta-prime.
- EB-local (Local Empirical Bayes) was competitive but slightly weaker in prediction.
Penalized Methods: Induced Smoothed LASSO performed best among penalized methods (Rank 8), followed by SCAD and MCP. Standard LASSO and Ridge performed worse than the top BMA methods.
Poor Performers: Classical stepwise selection and p-value thresholds ( $p < 0.05$ ) performed significantly worse than both Bayesian and penalized approaches.

B. Datasets With Separation

Shift in Performance: The landscape changed dramatically. BMA methods relying on standard g-priors (like Benchmark and Hyper-g) suffered significant degradation in point and interval estimation (high RMSE/MIS), though some retained predictive stability.
Top Performers: Penalized Likelihood methods became the most robust.
- Induced Smoothed LASSO ranked first, followed by LASSO, Elastic Net, SCAD, and MCP.
- Firth's bias-reduced method also performed well (Rank 8).
Robust Bayesian Methods: Among Bayesian approaches, EB-local and Spike-and-Slab remained the most robust, maintaining reasonable estimation and interval calibration despite separation.
Failure Rates: Methods like Induced Smoothed LASSO and Firth's method had high failure rates (approx. 28-30%) in separation scenarios, meaning they failed to converge on some datasets. However, when they succeeded, their performance was superior. All Bayesian methods maintained failure rates below 1%.

5. Significance and Recommendations

The study provides actionable guidance for applied researchers:

When Separation is Absent: Bayesian Model Averaging using the Benchmark Prior ( $g = \max(n, p^2)$ ) is the recommended approach. It offers the best balance of estimation accuracy, interval calibration, and predictive performance.
When Separation is Likely (or present): Penalized Likelihood methods (specifically LASSO, Elastic Net, SCAD, or MCP) are preferred for stability and prediction. If uncertainty quantification is required in these settings, EB-local is the most robust Bayesian alternative.
Avoid Classical Stepwise Selection: Both forward/backward stepwise and simple p-value thresholding consistently underperformed and are not recommended for modern logistic regression tasks involving model uncertainty.
Computational Efficiency: While BMA methods are generally robust, they can be computationally intensive. Penalized methods (like LASSO) offer a faster alternative when model uncertainty quantification is less critical than speed and stability.

Conclusion:
The paper demonstrates that no single method is universally optimal. The choice depends heavily on the data structure (specifically the presence of separation). The authors advocate for adaptive priors (like EB-local) as versatile defaults and highlight the superiority of penalized methods in high-risk separation scenarios. This work establishes a new benchmark for evaluating variable selection in logistic regression, emphasizing the need for methods that are robust to both model uncertainty and estimation instability.

Comparing Variable Selection and Model Averaging Methods for Logistic Regression

The Two Big Teams

The "Glitch" in the Matrix: Separation

The Losers (Who to Avoid)

The Final Verdict: What Should You Do?

Why This Matters

1. Problem Statement

2. Methodology

Study Design

Methods Evaluated (28 Total)

Performance Metrics

3. Key Contributions

4. Key Results

A. Datasets Without Separation

B. Datasets With Separation

5. Significance and Recommendations

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model