Comparative Analysis of Modern Machine Learning Models for Retail Sales Forecasting

Imagine you are the manager of a massive chain of grocery stores. Your biggest headache? Guessing how much soap, shampoo, and toothpaste to order for next month.

If you order too much, the products sit on the shelf gathering dust, costing you money. If you order too little, customers leave empty-handed, and you lose sales. This is the art of sales forecasting.

For a long time, managers used simple, old-school math (like looking at last month's sales and guessing) to make these predictions. But the world has changed. Sales are messy: some days a product sells 100 units, the next day it sells zero, and sometimes the data is just missing because the computer system glitched.

This paper is a race between three types of "predictors" to see who can guess the future sales best in a real-world, messy grocery store environment.

The Three Contenders

The Old Guard (Statistical Models): Think of these as the "grandparents" of forecasting. They use simple, proven rules (like Exponential Smoothing). They are reliable but often too rigid to handle the chaos of modern retail.
The Smart Ensembles (Tree-Based Models like XGBoost & LightGBM): Imagine a team of expert detectives. Instead of one person guessing, you have hundreds of them. Each detective looks at a different clue (price, day of the week, local weather, competitor sales) and makes a small guess. Then, they vote on the final answer. They are great at spotting patterns in messy, "tabular" data.
The Deep Learning Giants (Neural Networks like N-BEATS, N-HiTS, TFT): Think of these as super-intelligent AI students who have read every book in the library. They are designed to find incredibly complex, hidden patterns in massive amounts of data. They are the "cool new kids" in town, often used by giants like Amazon.

The Race Conditions: A Messy Reality

The researchers didn't test these models in a clean, perfect lab. They tested them on real data from a major retailer in Southeast Europe. The data was a nightmare:

Intermittent Demand: Some products sell every day; others sell once a month.
Missing Data: Sometimes the price of a competitor's item is missing.
Product Turnover: Items appear and disappear constantly.

They also tested two different strategies:

The "Specialist" Approach: Training a separate model for each group of products (e.g., one model just for toothpaste, another just for soap).
The "Generalist" Approach: Training one giant model to predict everything at once.

They also tried a "fix-it" strategy: using an AI to fill in the missing data (imputation) before training the models, hoping to clean up the mess.

The Results: Who Won?

The Surprise Winner: The Detective Team (Tree-Based Models)
The "Smart Ensembles" (specifically XGBoost and LightGBM) crushed the competition.

Why? They are like Swiss Army Knives. They handle messy data, missing values, and weird patterns without breaking a sweat. They don't need a massive library of data to work; they just need to understand the specific clues for each product.
The Score: XGBoost achieved the lowest error rate, meaning its predictions were closest to reality.

The Runner-Up (with a caveat): The Deep Learning Giants
The "Super-Intelligent AI" models did okay, but they struggled.

The Problem: They are like F1 race cars. They are amazing on a smooth, high-speed track (like Amazon's massive, clean e-commerce data). But on a bumpy, muddy dirt road (a physical grocery store with missing data and sporadic sales), they get stuck.
The "Fix-It" Experiment: When the researchers used AI to fill in the missing data, the Deep Learning models actually got better at predicting, but they still couldn't beat the Detective Teams. Interestingly, the "fix-it" AI actually made the Detective Teams worse in some cases because the "fixed" data looked too smooth and artificial, confusing the detectives.

Key Takeaways (The "So What?")

Don't Overcomplicate Things: Just because you have the fanciest, most expensive AI (Deep Learning) doesn't mean it's the best tool for the job. If your data is messy and fragmented (like a physical store), a simpler, robust model (Tree-Based) often wins.
Specialists Beat Generalists: Training a specific model for each product group worked better than trying to force one giant model to understand everything at once. It's like having a specialist doctor for your heart rather than a general practitioner trying to fix your heart, your knee, and your eyes all at once.
Garbage In, Garbage Out (Even for AI): Trying to "fix" missing data with complex AI didn't help much. Sometimes, it's better to let the model learn from the messy reality than to feed it a "cleaned" version that doesn't reflect the truth.

The Bottom Line

If you are a brick-and-mortar retailer trying to predict sales, don't go chasing the most complex Deep Learning models. Instead, use the "Detective Team" approach (Gradient Boosting). It's faster, cheaper to run, and in the messy real world of physical stores, it simply predicts better.

In short: In the world of retail forecasting, a sharp, adaptable detective often beats a super-intelligent robot that's never seen a muddy road.

1. Problem Statement

The study addresses the critical challenge of accurate demand forecasting for brick-and-mortar (B&M) retail, specifically within the hygiene product category. Unlike e-commerce or centralized warehouse environments, B&M retail is characterized by:

Fragmented Operations: Thousands of physical stores with limited shelf space and high variability.
Data Characteristics: Highly intermittent demand (frequent zero-sales days), substantial missingness (approx. 21% in auxiliary variables), and frequent product turnover.
Operational Constraints: The need to forecast daily sales over a 26-week horizon (182 days) to capture intra-week promotional volatility, which is then aggregated to weekly totals for business planning.

The core research question is whether state-of-the-art deep learning (DL) architectures (e.g., Transformers, N-BEATS) outperform gradient-boosted decision trees (GBDT) and statistical baselines in this specific, noisy, and sparse retail environment.

2. Methodology

Data Description

Source: Real-world data from a major Southeast European retailer.
Scale: 46,841 daily SKU-store time series.
Hierarchy: Products organized into Categories $\rightarrow$ Groups $\rightarrow$ Units-of-Need (UoNs) across 446 stores.
Challenges: 99.98% of series contain at least one zero-sale day; mean zero-sale days per series is ~492. Missingness affects auxiliary variables (e.g., competitor prices) rather than sales targets.

Preprocessing Pipeline

Cleaning: Removal of out-of-stock bias (filtering unavailable products) and outlier sanitization.
Feature Engineering:
- Temporal: Lag features, rolling windows (weekly/monthly/yearly), calendar effects (holidays, pension days).
- Contextual: Competitor density/prices (1km radius), macroeconomic indicators (CPI, salaries), and inflation-adjusted pricing.
- Leakage Prevention: Strict separation of "past-only" (historical sales, competitor prices) vs. "future-known" (calendar, planned promotions) covariates.
Imputation Strategy:
- Standard methods: Forward/backward fill, group median.
- Advanced experiment: SAITS (Self-Attention Imputation Transformer) used to impute missing values in both target and covariates to test if DL imputation aids DL forecasting.

Experimental Design

The study evaluated models across four configurations:

Case A: Localized modeling (one model per product group) on non-imputed data.
Case B: Global modeling (single model for all groups) on non-imputed data.
Case C: Localized modeling on SAITS-imputed data.
Case D: Global modeling on SAITS-imputed data.

Models Evaluated

Statistical Baselines: Mean, Theta, ETS, Croston SBA.
Tree-Based Ensembles: XGBoost, LightGBM.
Deep Learning: N-BEATS, N-HiTS, Temporal Fusion Transformer (TFT).
Training Strategy:
- Tree Models: One-step-ahead rolling protocol (predict $t+1$ given $t$ ).
- DL Models: Direct multi-horizon strategy (predict all 182 days in one pass), though rolling-origin was tested to ensure fair comparison.
Optimization: Hyperparameters tuned via HEBO (Bayesian optimization) on a 10% subset.

3. Key Contributions

End-to-End Pipeline: Provides a practical, reproducible pipeline for long-horizon retail forecasting, covering data enrichment, leakage-free feature engineering, and model training.
Comprehensive Empirical Comparison: Systematically compares statistical, tree-based, and neural models across model architecture, modeling scope (local vs. global), and preprocessing strategies (imputed vs. non-imputed).
Imputation Impact Analysis: Investigates the specific effects of SAITS-based imputation on model performance, revealing that sophisticated imputation can sometimes degrade performance in localized settings.
Practitioner Guidance: Offers evidence-based recommendations for B&M retailers, challenging the assumption that deeper architectures are always superior.

4. Key Results

Performance Metrics (RMSE & MAE)

Superiority of Tree-Based Models: XGBoost achieved the best overall performance with an RMSE of 4.833 (Case A), followed closely by LightGBM (4.847).
Deep Learning Performance: Neural models (N-BEATS, N-HiTS, TFT) consistently underperformed tree-based methods in localized settings, with RMSEs ranging from 7.7 to 16.3.
Impact of Aggregation:
- Tree models remained robust when moving from local (Case A) to global (Case B) settings.
- Deep learning models deteriorated significantly in global settings, struggling to capture diverse patterns across the whole category.
Impact of Imputation (SAITS):
- Case C (Local + Imputed): LightGBM suffered a catastrophic performance drop (RMSE jumped from 4.847 to 10.758). XGBoost showed more resilience but still degraded.
- Case D (Global + Imputed): Imputation helped neural models become competitive with tree models in the global setting, but XGBoost still maintained the lowest error (6.107).

Bias and Efficiency

Bias: Tree-based models exhibited slight underestimation (negative bias), while neural networks showed systematic overestimation (positive bias).
Computational Efficiency: Tree-based models were significantly faster. XGBoost training took ~14.5 minutes, whereas N-BEATS took 146 minutes and TFT took **4,425 minutes** (per epoch/total depending on configuration) for the same task.

Imputation Quality Analysis

SAITS-imputed values showed compressed variance and different distributions compared to original data (e.g., standard deviation dropped from 5.42 to 0.53 in one group).
This "distributional compression" introduced artifacts that confused tree-based models, particularly LightGBM, leading to the observed performance collapse in localized settings.

5. Significance and Conclusion

Model Selection Principle: The study concludes that for brick-and-mortar retail characterized by intermittent demand and fragmentation, localized tree-based ensembles (XGBoost/LightGBM) are superior to global deep learning architectures.
Architectural Sophistication vs. Problem Fit: The results challenge the trend of adopting complex neural networks for all forecasting tasks. The authors argue that alignment with problem characteristics (sparsity, intermittency, local variability) is more critical than architectural complexity.
Imputation Caution: While SAITS improved neural performance in global settings, it degraded localized tree-based performance. This suggests that "one-size-fits-all" imputation strategies can be detrimental depending on the modeling scope.
Practical Implication: Retailers should prioritize localized modeling strategies using gradient-boosted trees for inventory management and demand planning, as they offer the best balance of accuracy, speed, and robustness to data irregularities.

Limitations: The findings are specific to the hygiene category and the specific B&M context studied; they may not generalize to high-frequency e-commerce environments where data density is higher (where scaling laws for transformers might apply). Future work should explore hybrid approaches.