Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost

Imagine you are running a chain of lemonade stands. Your biggest headache? Guessing how many cups of lemonade you need to make tomorrow.

If you make too little, customers leave thirsty (you lose money and reputation). If you make too much, the lemonade goes sour and you have to throw it away (you lose money on waste). This is the classic "Newsvendor Problem."

For decades, business owners have used simple rules of thumb or basic math to make these guesses. But today, we have powerful computers and Artificial Intelligence (AI) that can look at patterns we can't see.

This paper asks a simple but crucial question: "Does using fancy AI to predict demand actually save us money, or is it just a cool trick that looks good on a report card?"

Here is the story of what they found, explained simply.

1. The Setup: The "Lemonade" Experiment

The researchers didn't just guess; they used a massive, real-world dataset from Walmart (the M5 dataset). They focused on one specific section: Food items in California.

They set up a digital simulation (a video game version of a supply chain) with two levels:

Level 1 (The Store): The lemonade stand itself.
Level 2 (The Warehouse): A big central kitchen that supplies all the stands.

They tested seven different "guessing machines" to see which one made the best predictions:

The Old Schoolers: Simple math rules like "Yesterday's sales = Today's sales" (Naive) or "Average of the last week" (Holt-Winters).
The Smart Statisticians: Complex math models like ARIMA.
The Machine Learning Pros: Advanced algorithms like XGBoost (which learns from trees of data).
The Deep Learning Giants: AI models that mimic the human brain, specifically LSTM (which remembers long-term patterns) and Temporal CNN (which spots patterns in time like a camera scanning a video).

2. The Game: How They Measured Success

Usually, scientists measure success by "Accuracy" (how close the guess was to the actual number). But the researchers said, "That's not enough!"

They cared about Money.

The Cost of Being Wrong:
- Overage Cost: Making too much lemonade (waste).
- Underage Cost: Not having enough lemonade (angry customers).
The Goal: Find the model that keeps the total bill (waste + lost sales) the lowest.

3. The Results: The AI Wins the Race

The results were clear. The "Deep Learning" models (the AI giants) were the champions.

The Winner: The Temporal CNN model. It was like a super-athlete that could see patterns in the data that the others missed. It reduced inventory costs by nearly 19% compared to the simple "guess yesterday's sales" method.
The Runner-Up: The LSTM model, which was also excellent.
The Losers: The old-school statistical models (ARIMA, Holt-Winters) struggled. They were like trying to navigate a stormy ocean with a paper map; they couldn't handle the sudden changes in customer behavior.

The Analogy:
Imagine the old models are like a weatherman who only looks out the window. If it's sunny today, he says it will be sunny tomorrow.
The AI models are like a weatherman with a satellite, radar, and a supercomputer. They can see a storm coming three days away, even if the sky is currently blue. Because they see the storm coming, they can prepare (stock up on umbrellas) and avoid getting soaked.

4. The Twist: The "Bullwhip Effect"

The researchers also tested what happens when you add a Warehouse (Level 2) into the mix.

The Problem: If the Warehouse makes a small mistake in guessing how much lemonade the stores need, that mistake gets magnified as it flows down to the stores.
The Finding: Even a tiny error at the top (the Warehouse) causes huge chaos at the bottom (the Stores). However, because the AI models were so accurate, they prevented this "whip" from cracking as hard. They kept the whole supply chain calm and efficient.

5. Why This Matters to You

This paper proves that better predictions = more money in the pocket.

For Businesses: Don't just buy expensive software because it sounds "smart." Buy it because it actually lowers your costs and keeps customers happy. The AI models didn't just predict better; they saved real dollars by reducing waste and stockouts.
For the Future: As supply chains get more complex (think global shipping, sudden pandemics, or viral trends), relying on old math won't work. We need AI that can "feel" the rhythm of the market to keep shelves stocked and prices low.

The Bottom Line

Think of this study as a taste test for business strategies.
They took a bunch of different "recipes" for predicting the future. The result? The fancy, high-tech AI recipes tasted the best and saved the most money. It turns out, in the chaotic world of retail, a little bit of artificial intelligence goes a long way toward keeping your business from going sour.

1. Problem Statement

Modern supply chains face increasing volatility, making demand forecasting a critical component of inventory optimization. However, a significant gap exists in current literature:

Metric Misalignment: Most forecasting studies evaluate models based solely on statistical error metrics (e.g., RMSE, MAE), which do not directly correlate with operational business outcomes like inventory costs or service levels.
Single-Echelon Limitation: Many studies focus on single-echelon (store-level) forecasting, ignoring the "bullwhip effect" where forecast errors propagate upstream in multi-echelon networks (Distribution Centers to Stores), amplifying costs.
Lack of Unified Comparison: There is a scarcity of empirical studies that compare classical statistical methods, machine learning (ML) ensembles, and deep learning (DL) architectures under a single protocol that links predictive accuracy to downstream multi-tier inventory performance.

2. Methodology

The authors propose a digitalized forecasting–inventory optimization pipeline that integrates diverse modeling approaches into a unified simulation framework.

A. Dataset and Preprocessing

Source: The M5 Walmart Forecasting dataset (daily sales).
Subset: Focused on CA FOODS 1 (California, Food Department 1) for controlled benchmarking.
Features: Engineered predictors include lags ( $t-1, t-7, t-14, t-28$ ), rolling means (7/14/28 days), and calendar/event indicators (including SNAP benefits).
Validation Strategy: A rolling holdout approach was used: 28 days for testing, 28 days for validation, and the remainder for training.

B. Forecasting Models Evaluated

Seven distinct model classes were implemented and compared:

Naive (Lag-1): Persistence baseline.
Holt–Winters Exponential Smoothing: Captures level, trend, and weekly seasonality.
ARIMA(1,1,1): Linear time-series model for short-range autocorrelation.
Gradient Boosting Regressor (GBR): Ensemble of regression trees.
XGBoost: Regularized gradient-boosted trees optimized for scalability.
LSTM (Global): Recurrent neural network learning temporal representations across series.
Temporal CNN: Causal dilated-convolution model for efficient long-context modeling.

C. Operational Evaluation Framework

Instead of stopping at error metrics, the study embeds forecasts into two simulation environments:

Single-Echelon Newsvendor Simulator:
- Maps point forecasts directly to order quantities ( $Q = \max(0, \hat{D})$ ).
- Calculates costs based on overage (holding cost $h$ ) and underage (shortage cost $b$ ).
- KPIs: Average daily cost and demand-weighted Fill Rate (FR).
Two-Echelon Extension (DC–Store):
- Simulates a Distribution Center (DC) supplying multiple stores.
- DC demand is the aggregate of store forecasts.
- Includes a proportional allocation mechanism for fulfillment when DC inventory is insufficient.
- Measures network-wide cost and fill rate to assess error propagation.

D. Sensitivity Analysis

The study tests model robustness across varying shortage penalties ( $b \in \{2, 5, 10\}$ ) while holding holding costs constant ( $h=1$ ), simulating different risk profiles.

3. Key Contributions

Unified Pipeline: Developed a standardized framework integrating statistical, ML, and DL models to evaluate them against identical operational constraints.
Operational Translation: Quantified how predictive accuracy translates into tangible business metrics (cost reduction and fill rate improvement) rather than just statistical error.
Multi-Echelon Insight: Extended the evaluation to a two-tier system, demonstrating how forecast errors at the DC level disproportionately impact downstream store performance.
Robustness Analysis: Provided sensitivity analysis showing how different models perform under varying cost structures (high shortage penalties vs. high holding costs).

4. Key Results

Using the M5 CA FOODS 1 dataset with $h=1$ and $b=5$ :

Forecast Accuracy: Deep learning models (LSTM and Temporal CNN) achieved the lowest RMSE (2.207 and 2.260, respectively), outperforming classical baselines (ARIMA: 2.636; Naive: 2.909).
Inventory Cost Reduction:
- Temporal CNN achieved the lowest average daily cost (3.674), representing an 18.7% reduction compared to the Naive baseline.
- LSTM followed closely with a cost of 3.704 (18.1% reduction).
- Tree ensembles (XGBoost/GBR) also outperformed classical methods but lagged behind DL models.
Fill Rate Improvement:
- Temporal CNN achieved the highest fill rate (0.632), a 9.8 percentage point (pp) improvement over the Naive baseline.
Sensitivity to Shortage Penalties:
- As the shortage penalty ( $b$ ) increased, absolute costs rose for all models.
- However, the ranking remained stable: Deep learning models consistently maintained the lowest costs across all penalty levels, indicating superior robustness to asymmetric service penalties.
Multi-Echelon Findings:
- Errors in DC-level demand aggregation propagated downstream, amplifying costs and service degradation across the network. This highlights that improving store-level forecasting alone is insufficient without accurate upstream aggregation.

5. Significance and Conclusion

Managerial Impact: The study provides a direct economic argument for adopting deep learning in supply chains. It demonstrates that investing in advanced forecasting pipelines yields quantifiable reductions in inventory costs and improvements in service levels, moving beyond "accuracy for accuracy's sake."
Model Selection: Temporal CNN is identified as the most robust model, particularly effective in handling long-range patterns and varying cost ratios.
Future Directions: The authors suggest future work should incorporate probabilistic (quantile) forecasting to better align with service targets, integrate price elasticity and promotion lifts, and explore reinforcement learning for dynamic inventory control in complex multi-echelon networks.

In summary, the paper successfully bridges the gap between data science and operations management, proving that Deep Learning models (specifically Temporal CNN and LSTM) significantly outperform traditional statistical and ML methods when evaluated by real-world inventory costs and service levels.