Comparative Evaluation of Logistic Regression and… — Plain-Language Explanation

Imagine you are the captain of a ship, and your job is to navigate through a foggy ocean. You know that "storms" (flu outbreaks) happen every winter, but you don't know exactly when the next big wave will hit. Usually, you only realize a storm is coming when you see the first massive wave crash over the deck. By then, it's often too late to prepare the crew or secure the cargo.

This paper is about building a better radar system to see those storms coming before the first wave hits.

Here is the story of the research, broken down into simple terms:

1. The Problem: Looking in the Rearview Mirror

Right now, public health officials (like the CDC) act like drivers looking in their rearview mirror. They collect data every week about how many people are sick with "flu-like" symptoms. They can tell you, "Hey, last week, the flu was bad." But they often struggle to say, "The flu is about to get bad next week."

The researchers wanted to change this. They wanted to turn the data into a traffic light system:

Green: Everything is normal.
Red: An outbreak is happening (or about to happen).

2. The Tools: Two Different Navigators

To build this radar, the researchers tested two different "navigators" (computer models) to see which one could spot the storm first and most accurately.

Navigator A (Logistic Regression): Think of this as a veteran sailor. It's an old-school, tried-and-true method. It looks at the past few weeks of weather and uses simple math to guess if a storm is coming. It's transparent, easy to understand, and very reliable.
Navigator B (XGBoost / Gradient Boosting): Think of this as a high-tech AI robot. It's a modern machine learning tool that can spot incredibly complex patterns in the data that a human or a simple sailor might miss. It's like having a supercomputer that can read the clouds, the wind, and the water temperature all at once.

3. The Training: Learning from History

The researchers didn't just guess. They taught both navigators using 10 years of historical data (from 2010 to 2017). They defined a "storm" (outbreak) as any week where the number of sick people went above a specific high mark (the 90th percentile).

Once the navigators learned the rules, the researchers tested them on new, unseen data (from 2020 to 2025). This is like giving the sailors a map of a part of the ocean they had never seen before to see if they could still find the storms.

4. The Results: A Surprising Tie

Here is the twist: Both navigators were incredibly good.

The Veteran Sailor (Logistic Regression): It was almost perfect. It spotted 100% of the actual outbreaks. It never missed a storm. However, it sounded the alarm a few times when there was no storm (false alarms), but it was very good at catching the real danger.
The AI Robot (XGBoost): It was also nearly perfect. It was slightly better at not sounding false alarms, but it missed a tiny fraction of the actual storms compared to the sailor.

The Big Takeaway: The fancy, complex AI robot didn't do much better than the simple, old-school sailor. In fact, the simple sailor was slightly better at making sure they didn't miss a single outbreak.

5. Why This Matters: The "Early Warning"

The most important part of this study isn't just that the computers worked; it's how they worked.

Instead of just predicting "There will be 5,000 sick people next week" (which is hard to act on), these models predict: "Turn the Red Light on now."

This is a game-changer for hospitals and communities:

If the light turns Red early: Hospitals can call in extra nurses before the ER gets crowded.
If the light turns Red early: Schools can prepare for closures.
If the light turns Red early: Public health officials can tell people, "Get your flu shot now, don't wait."

The Bottom Line

This study shows that we don't need super-complex, expensive AI to predict flu outbreaks. We can use simple, transparent math on the data we already have (the weekly reports of sick people) to build a highly accurate early-warning system.

It's like realizing you don't need a $10,000 satellite to know it's going to rain; sometimes, a simple barometer (the old-school model) works just as well, if not better, at telling you when to grab your umbrella.

In short: We now have a reliable, easy-to-use "flu radar" that can help us prepare for the storm before it hits, saving lives and keeping hospitals from getting overwhelmed.

1. Problem Statement

Seasonal influenza poses a significant public health burden, causing millions of illnesses and substantial strain on healthcare systems annually. While the CDC's Outpatient Influenza-like Illness Surveillance Network (ILINet) provides critical data, traditional surveillance is fundamentally retrospective. Current methods often identify high-activity periods only after transmission has accelerated, limiting the time available for hospital surge planning, antiviral distribution, and public health messaging.

Existing forecasting literature often focuses on continuous prediction (e.g., predicting exact ILI percentages or peak timing) using metrics like Mean Absolute Error (MAE). However, these metrics do not directly address the operational decision-making needs of public health officials, who require a clear binary alert (Outbreak vs. No Outbreak) to trigger specific interventions. There is a gap in head-to-head quantitative evaluations comparing traditional statistical models against modern machine learning approaches specifically framed as a threshold-based early-warning problem using strict temporal validation.

2. Methodology

Data Sources and Study Design

Data: Weekly national ILINet data (percentage of outpatient visits for ILI) and FluView laboratory surveillance data (percent positivity) from 2010 to 2025.
Unit of Analysis: National level (aggregated to reduce noise from local reporting variations).
Temporal Split: To prevent data leakage and simulate real-world deployment, the data was strictly partitioned chronologically:
- Training Period: 2010–2017 (used to define the threshold and train models).
- Validation Period: Intermediate seasons (used for hyperparameter tuning).
- Test Period: 2020–2025 (held-out data for final performance evaluation).

Outcome Definition

The study reframed the prediction task from continuous forecasting to binary classification:

Target Variable ( $Outbreak_t$ ): A binary indicator where $1$ represents an outbreak week and $0$ represents a non-outbreak week.
Threshold ( $T$ ): Defined as the 90th percentile of ILIPERCENT calculated only from the training period (2010–2017).
- Calculated Threshold: 3.3932%.
- Rule: $Outbreak_t = 1$ if $ILIPERCENT_t \geq 3.3932\%$ ; otherwise $0$.

Predictor Variables

Models utilized features available at or before week $t$ :

Autoregressive Features: 1-, 2-, and 3-week lags of ILIPERCENT.
Laboratory Features: 1-, 2-, and 3-week lags of percent positive laboratory specimens.
Seasonality: Harmonic terms (sine and cosine transforms of the week of the year) to capture annual periodicity.

Models Evaluated

The study compared traditional statistical baselines with machine learning (ML) approaches:

Logistic Regression: A transparent, parametric baseline for binary classification.
XGBoost (Gradient Boosting): A high-performance tree-based ensemble method.
(Note: While SARIMA, Random Forest, and LSTM were mentioned in the methods as part of the broader design, the primary reported results focus on the comparison between Logistic Regression and XGBoost).

Evaluation Metrics

Primary: Area Under the Receiver Operating Characteristic Curve (AUC-ROC), Precision-Recall AUC (PR-AUC), Sensitivity, Specificity, Precision, and F1-score.
Secondary: Lead time (weeks gained before the threshold is crossed) and continuous forecasting error (MAE, RMSE).

3. Key Results

The models were evaluated on the temporally held-out test set (2020–2025).

Metric	Logistic Regression	XGBoost
AUC-ROC	0.9964	0.9946
PR-AUC	0.9868	0.9812
Sensitivity	1.0000 (Perfect)	0.8939
Specificity	0.9516	0.9798
Precision	0.8462	0.9219
F1-Score	0.9167	0.9077

Key Observations:

Near-Perfect Discrimination: Both models achieved AUC scores > 0.99, indicating they can almost perfectly distinguish between outbreak and non-outbreak weeks.
Sensitivity vs. Specificity Trade-off:
- Logistic Regression achieved 100% sensitivity, meaning it successfully identified every outbreak week in the test period (zero false negatives), though with slightly lower specificity.
- XGBoost achieved higher specificity (97.98%) and precision (92.19%), resulting in fewer false alarms, but missed approximately 10% of outbreak weeks (lower sensitivity).
Model Complexity: Despite XGBoost's reputation for handling complex non-linear relationships, the simpler Logistic Regression model performed comparably or slightly better in overall discrimination (AUC) and recall.

4. Key Contributions

Operational Framing: The study shifts the focus from "how many cases?" to "is there an outbreak?" By defining the outcome as a binary threshold crossing, the research directly aligns predictive analytics with public health decision-making (alert vs. no alert).
Rigorous Temporal Validation: Unlike many studies that use random cross-validation (which can lead to data leakage in time-series), this study used a strict chronological split (2010–2017 train vs. 2020–2025 test), providing a more realistic assessment of model generalizability to future seasons.
Benchmarking ML vs. Traditional Stats: The results challenge the assumption that complex ML models are always necessary for surveillance. For national-level ILINet data with appropriate feature engineering (lags + seasonality), a simple Logistic Regression model outperformed or matched XGBoost.
Reproducibility: The study utilizes publicly available CDC data and provides detailed hyperparameters and code availability statements, ensuring the framework can be replicated and adapted by other health departments.

5. Significance and Implications

Public Health Preparedness: The framework demonstrates that reliable early-warning systems can be built using existing, publicly available data without requiring proprietary or complex infrastructure.
Actionable Alerts: High sensitivity (Logistic Regression) ensures that healthcare systems are not caught off guard by sudden surges, while high specificity (XGBoost) minimizes resource waste from false alarms. Public health officials can choose the model based on their risk tolerance (e.g., prioritizing sensitivity during a pandemic).
Policy Relevance: By linking model output to a clear, reproducible threshold, the study addresses the "so what?" critique in forecasting research. It provides a direct mechanism for triggering surge staffing, vaccination campaigns, and public messaging.
Future Directions: The authors suggest extending this framework to regional/state levels to capture geographic heterogeneity and integrating additional predictors (mobility, climate) to further refine lead times.

Conclusion:
The study concludes that influenza outbreak detection can be implemented with near-perfect accuracy using standard CDC surveillance data. The findings suggest that for national-level early-warning, simpler, interpretable models (Logistic Regression) may be preferable to complex "black box" algorithms due to their superior sensitivity and transparency, provided that rigorous temporal validation and appropriate feature engineering are employed.

Comparative Evaluation of Logistic Regression and Gradient Boosting Models for Influenza Outbreak Early-Warning Using U.S. CDC ILINet Surveillance Data (2010-2025)