A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Imagine you are a farmer in Bangladesh trying to decide when to sell your garlic or green chilies. If you sell too early, you might lose money; if you wait too long, the price might crash. To make the best decision, you need a crystal ball that can predict the future price of these crops.

This paper is essentially a report card for different "crystal balls" (computer models) trying to predict the prices of five common Bangladeshi crops: garlic, chickpeas, green chilies, cucumbers, and sweet pumpkins.

Here is the story of what they found, explained simply:

1. The Missing Map (The New Dataset)

Before this study, trying to predict these prices was like trying to navigate a city without a map. There was no single, clean list of daily prices for these specific crops available to researchers.

The Fix: The authors built a new map called AgriPriceBD. They used a smart AI assistant (an LLM) to read thousands of old, messy government PDF reports and turn them into a clean, digital list of prices from 2020 to 2025. Now, anyone can use this map to test their own prediction tools.

2. The Race: Old School vs. High-Tech

The researchers pitted seven different prediction methods against each other. Think of them as runners in a race:

The "Lazy" Runner (Naïve Persistence): This model assumes the price tomorrow will be exactly the same as today. It's simple and doesn't try to be smart.
The "Statistical" Runners (SARIMA & Prophet): These are classic tools that look for patterns, seasons, and holidays.
The "Deep Learning" Runners (BiLSTM, Transformers, Informer): These are fancy, modern AI models that try to learn complex patterns from data, similar to how a human brain learns.

3. The Shocking Results

🏆 The Winner: Sometimes, "Lazy" is Best

The biggest surprise was that for some crops (like garlic and chickpeas), the "Lazy" runner was actually the best or tied for the best.

The Analogy: Imagine trying to predict the weather in a place where the weather changes randomly every hour. No matter how complex your supercomputer is, the best guess for "what happens next" is just "it will probably be like it is right now."
The Lesson: For some crops, the price moves so randomly (like a drunk person walking home) that complex AI models can't find a pattern to exploit. They just get confused.

❌ The "Smooth" Tool Failed (Prophet)

The Prophet model is famous for being easy to use and great at predicting things like sales or website traffic.

The Failure: It failed miserably on these crops.
The Analogy: Prophet is like a smoothie blender. It assumes prices change smoothly, like a gentle river. But in Bangladesh's markets, prices are more like a staircase. They stay flat for a week, then suddenly jump up or down because of a policy change or a storm. Prophet tried to draw a smooth curve through these sharp stairs, resulting in a terrible prediction.

⚠️ The "Over-Engineered" Tool Broke (Informer)

The Informer is a very powerful, high-tech AI designed for massive datasets (like predicting stock markets with millions of data points).

The Failure: It went crazy. Instead of predicting prices, it started screaming random numbers.
The Analogy: Imagine giving a Formula 1 race car to a child to drive to the grocery store. The car is too powerful and complex for the small task. The Informer was trying to find deep, hidden patterns in a tiny dataset, and instead of learning, it started "hallucinating" and amplifying noise. It was too big for the job.

🧪 The "Learnable" Time Trick Didn't Help

The researchers tried a special trick called Time2Vec, which lets the AI "learn" how time works (like knowing that December is always cold).

The Result: It didn't help. In fact, for the most volatile crop (green chilies), it made things 146% worse.
The Analogy: It's like giving a student a textbook that is too advanced. Instead of helping them learn, the extra complexity just confused them. The simple, fixed way of telling time worked better than the fancy "learning" way.

4. The Green Chili Mystery

Green chilies were the hardest to predict. Their prices jump around wildly due to rain, border closures, and storage issues.

The Finding: Even the smartest AI couldn't predict them well. The best strategy was just to guess the price would stay the same as today.
The Takeaway: To predict green chilies, you don't need a better price-prediction AI; you need outside information (like rainfall data or import numbers). The price history alone isn't enough.

Summary: What Should We Do?

This paper teaches us three main things for developing economies like Bangladesh:

Don't assume bigger is better: The most complex AI models (like Informer) often fail when you don't have enough data.
Know your market: If prices jump like stairs (discrete steps), don't use tools designed for smooth rivers (like Prophet).
Simplicity wins: Sometimes, the simplest guess (today's price = tomorrow's price) is the most accurate because the market is just too noisy to predict.

The authors have shared their data and code for free, hoping that farmers, policymakers, and other researchers can use this "map" to make better decisions and keep food prices stable for everyone.

1. Problem Statement

Accurate short-term forecasting of agricultural commodity prices is vital for food security, policy-making, and income stabilization in developing economies like Bangladesh. However, the field faces two critical gaps:

Data Scarcity: There is no publicly available, daily, multi-commodity retail price benchmark for Bangladesh. Existing research is limited to single commodities (mostly rice) or wholesale data.
Model Applicability: It is unclear whether advanced forecasting models (specifically those designed for smooth time series, like Prophet and large-scale Transformers) can handle the discrete step-function dynamics characteristic of developing-economy retail markets, where prices remain stable for long periods before jumping suddenly due to supply shocks or policy changes.

2. Methodology

A. Dataset Construction: AgriPriceBD

The authors introduced AgriPriceBD, a novel benchmark dataset with the following specifications:

Scope: Daily retail mid-prices for five key commodities: Garlic, Chickpea, Green Chilli, Cucumber, and Sweet Pumpkin.
Duration: July 2020 to June 2025 (1,779 daily observations per commodity).
Extraction Pipeline: Since no structured API exists, the team developed an LLM-assisted pipeline using the Gemini API to parse government PDF reports.
- Process: Systematic PDF download $\rightarrow$ Bilingual (English/Bangla) prompt engineering for JSON extraction $\rightarrow$ Validation against price constraints (0.1–500 BDT/kg) $\rightarrow$ Mid-price computation ( $p_t = \frac{min_t + max_t}{2}$ ).
Data Characteristics: The dataset exhibits heterogeneous stationarity. Garlic and Chickpea are non-stationary (trending), while Green Chilli, Cucumber, and Sweet Pumpkin are stationary. Cross-commodity correlations are generally low, supporting univariate modeling.

B. Experimental Design

Split: 80% Training (1,423 days), 10% Validation, 10% Test (May–June 2025). Strict temporal ordering was maintained to prevent look-ahead bias.
Input/Output: 90-day sliding window inputs predicting a 14-day horizon.
Models Evaluated:
- Classical: Naïve Persistence, SARIMA, Prophet.
- Deep Learning: BiLSTM, Vanilla Transformer, Time2Vec-enhanced Transformer (T2V-Transformer), and Informer (tested but excluded from main comparison due to failure).
Statistical Testing: The Diebold-Mariano (DM) test with Harvey-Leybourne-Newbold (HLN) correction was used to determine if performance differences between models were statistically significant.

3. Key Contributions

AgriPriceBD Dataset: The first publicly available daily multi-commodity retail price dataset for Bangladesh, released with full code for reproducibility.
Systematic Benchmarking: A rigorous comparison of seven forecasting approaches, explicitly documenting failure modes of models not previously tested in this context (Prophet and Informer).
Temporal Encoding Ablation: A controlled study comparing fixed sinusoidal positional encoding against learnable Time2Vec embeddings using statistical significance testing.
Negative Results as Insights: Explicitly identifying scenarios where complex models fail (e.g., Prophet's inability to handle step functions, Informer's instability on small data) provides practical guidance for practitioners.

4. Key Results

A. Forecasting Performance

Heterogeneous Forecastability: No single model dominates all commodities. Forecastability depends on the signal-to-noise structure (Residual-to-Seasonal ratio) of the specific commodity.
Naïve Persistence: Dominates on commodities with near-random-walk behavior (e.g., Green Chilli), outperforming complex models.
BiLSTM: Achieved the best deep learning performance overall, particularly on non-stationary commodities (Garlic, Chickpea). It was the only DL model to show a statistically significant improvement over Naïve Persistence for Garlic and Cucumber.
Prophet Failure: Prophet failed systematically across all commodities (high MAPE, e.g., 74.6% for Sweet Pumpkin). The authors attribute this to the model's assumption of smooth, continuous trends, which contradicts the discrete "step-function" price jumps common in Bangladeshi markets.
Informer Failure: The Informer architecture produced erratic, noise-amplifying predictions (prediction variance up to 4,987% of ground truth for Chickpea). The sparse-attention mechanism, designed for massive datasets ( $10,000+$ observations), failed to learn coherent patterns on the small agricultural dataset ( $\approx 1,400$ windows).

B. Temporal Encoding Ablation (Vanilla vs. T2V-Transformer)

Learnable vs. Fixed: Contrary to expectations, learnable Time2Vec embeddings provided no statistical advantage over fixed sinusoidal encoding.
Catastrophic Degradation: The T2V-Transformer significantly degraded performance on four of five commodities.
- Green Chilli: MAE increased by 146.1% ( $p < 0.001$ ).
- Chickpea: MAE increased by 253.3%.
Conclusion: At this training scale, learnable temporal parameters overfit to noise, discovering spurious periodicities that do not generalize.

5. Significance and Implications

Practitioner Guidance: The study warns against blindly applying large-scale Transformer architectures (like Informer) or complex temporal encodings (Time2Vec) to small-sample agricultural datasets. Simpler recurrent models (BiLSTM) or even Naïve baselines are often superior.
Model Selection: The Residual-to-Seasonal (R/S) ratio from STL decomposition is proposed as a practical prior for model selection. High R/S ratios indicate noise-dominated series where complex models are unlikely to succeed.
Policy Impact: The findings highlight that standard decomposition tools (like Prophet) are unsuitable for developing-economy markets with administered or infrequently updated prices. Policymakers and researchers must account for discrete price jumps and consider exogenous features (weather, import volumes) rather than relying solely on historical price data.
Reproducibility: By releasing the dataset and code, the authors establish a reproducible baseline for future research in South Asian agricultural economics and time-series forecasting.

Summary of Failure Modes Identified

Model	Failure Mode	Root Cause
Prophet	Systematic directional bias	Assumes smooth trends; cannot model discrete step-function jumps.
Informer	Erratic, high-variance oscillation	Sparse attention and distilling layers require large datasets ( $>10k$ obs); fails on small samples.
T2V-Transformer	Catastrophic performance drop	Learnable temporal embeddings overfit to noise in small datasets; fixed encoding is more robust.