RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Imagine you hire a super-smart, highly educated robot manager to run a small supermarket. You give it a computer brain powered by the latest Artificial Intelligence (AI). Your goal? To see if this robot can run the store successfully for months, making thousands of decisions about what to buy, how much to charge, and when to restock, all while dealing with unpredictable customers and changing news.

This paper, RetailBench, is essentially a "stress test" for these AI managers. Here is the breakdown in simple terms:

1. The Problem: The "Short Attention Span" Robot

Current AI models are like brilliant students who can ace a 10-minute pop quiz. They are great at short tasks, like writing a poem or solving a math problem. But when you ask them to run a business for a whole year, they tend to lose their minds. They forget their original plan, make up facts that aren't true, or panic and make terrible decisions.

The researchers wanted to see: Can an AI keep a coherent strategy over a long time in a messy, real-world situation?

2. The Test: The "Supermarket Simulator"

To test this, they built RetailBench. Think of this as a video game simulation of a grocery store, but with real-world rules:

The Goal: Keep the store open as long as possible without going bankrupt.
The Chaos: Customers come and go randomly. Products expire (like milk). Suppliers change prices. Sometimes there's bad news in the paper that makes people buy less.
The Trap: If the store can't pay its daily rent for 5 days in a row, the game ends (the store goes bankrupt).

They tested 8 different top-tier AI models in this simulator, ranging from "Easy" (5 types of products, no news) to "Hard" (20 types of products, constant news, and tricky supply chains).

3. The Solution: The "General vs. The Soldier"

The researchers noticed that when AI tries to think and act at the same time, it gets confused. So, they invented a new way to organize the AI, called Evolving Strategy & Execution.

They split the AI's brain into two distinct roles:

The General (Strategy Phase): Once a day, the "General" sits in a quiet room, looks at all the data, reviews the past, and writes a Master Plan. It decides the big picture: "Today, we focus on selling soup and lowering prices on bread." Once the plan is written, the General goes to sleep.
The Soldier (Execution Phase): The "Soldier" wakes up and follows the General's orders strictly. It doesn't stop to rethink the plan every time it sees a customer. It just executes the orders: "Buy soup, lower bread price."

The Analogy: Imagine a ship captain (General) who charts the course for the day, and a crew (Soldier) that steers the ship. Without this split, the crew would keep changing the course every time they saw a wave, and the ship would never reach its destination.

4. The Results: Good News, Bad News

The Good News:
The new "General vs. Soldier" method worked! The AI stores lasted longer, made more money, and wasted less food compared to other methods where the AI tried to think and act simultaneously. It proved that separating planning from doing helps AI stay stable.

The Bad News:
Even with the best new method, the AI still failed when the game got too hard.

The "Hallucination" Problem: The AI started making things up. It would try to order "Product #999" which didn't exist, or set the price of milk to $999.
The "Memory" Problem: As the store got bigger (more products), the AI couldn't keep track of everything. It would ignore important data, like customer reviews, and just guess.
The "Drift" Problem: Even with a plan, the AI's daily actions would slowly drift away from the plan, causing chaos.

5. The Conclusion

The paper concludes that while AI is getting smarter, it is still not ready to run a real business on its own.

Current State: AI is like a very smart intern who needs constant supervision. It can handle a simple task, but if you leave it alone in a complex, changing environment for a long time, it will eventually crash the company.
Future Work: We need better ways to stop AI from making up facts (hallucinations) and better systems to help it remember long-term goals without getting overwhelmed by too much information.

In a nutshell: We built a tough test to see if AI can run a store. We found a better way to organize the AI's brain (Plan first, act later), which helped a lot. But the AI still gets confused, makes up facts, and gives up when the job gets too hard. We aren't quite ready to replace human store managers with robots yet.

1. Problem Statement

While Large Language Model (LLM) agents have shown success in short-horizon, highly structured tasks (e.g., code generation, web browsing), they struggle with long-horizon autonomous decision-making in realistic, dynamic environments. Existing benchmarks often fail to capture the complexities of real-world economic systems, which require:

Persistent Objective Alignment: Maintaining goals over extended timeframes (days/weeks).
Stochastic Dynamics: Adapting to random demand, supply chain delays, and external shocks (e.g., news events).
Strategy Stability: Avoiding "goal drift" or oscillating behaviors where the agent's strategy changes erratically between steps.
Multi-Factor Integration: Simultaneously managing pricing, inventory, financial liquidity, and information acquisition.

The paper posits that current LLM agents lack the robustness to operate autonomously in such environments, often leading to economic irrationality, hallucinations, and premature episode termination.

2. Methodology

A. RetailBench: The Benchmark

The authors introduce RetailBench, a high-fidelity simulation of supermarket operations modeled as a Markov Decision Process (MDP).

Environment: A single-store simulation running over a finite horizon (up to 1,000+ days). Episodes terminate if the store fails to pay rent for 5 consecutive days.
State Space ( $S$ ): Composed of six interdependent components:
1. Product & Inventory: SKU attributes, shelf life, on-hand stock, and historical sales (grounded in the Dominick's dataset).
2. Supply Chain: Supplier prices, quality levels, and lead times.
3. Demand Signals: Customer traffic, reviews, and returns.
4. External Context: Dynamic news events affecting specific categories or the macro market.
5. Financial State: Cash flow, net worth, and inventory depreciation.
Action Space ( $A$ ): Agents can perform:
- Pricing: Adjust prices for individual SKUs.
- Replenishment: Select suppliers and order quantities.
- Information Query: Access sales history, reviews, or news.
- Memory: Write/read persistent notes across days.
- Termination: End the day to trigger state transitions.
Difficulty Levels:
- Easy: 5 categories, static supply/demand, no news.
- Middle: 20 categories, static supply/demand, no news.
- Hard: 20 categories, dynamic news events, time-varying supplier price-quality relationships.

B. Proposed Framework: Evolving Strategy & Execution

To address the instability of standard frameworks (like ReAct or Reflection), the authors propose a two-stage hierarchical framework:

Evolving Strategy Stage (Day-Level):
- The agent analyzes environmental feedback and historical data.
- It explicitly updates a Global Strategy (Macro Strategy + Execution Strategy).
- Constraint: No direct environment modification occurs here; the agent only plans.
Execution Stage (Step-Level):
- The agent executes concrete actions strictly adhering to the fixed strategy defined in the previous stage.
- Constraint: The strategy is immutable during execution to prevent oscillation and ensure temporal consistency.

This design separates strategic deliberation (long-term planning) from operational execution (short-term actions), aiming to reduce error accumulation and strategy drift.

3. Key Contributions

RetailBench Benchmark: A realistic, stochastic retail environment designed specifically to stress-test long-horizon autonomy, featuring complex state transitions and economic constraints.
Evolving Strategy & Execution Framework: A novel agent architecture that decouples strategy evolution from action execution, significantly improving operational stability compared to step-level or day-level reflection baselines.
Comprehensive Evaluation: An extensive study of 8 state-of-the-art LLMs (including GPT-5.2, Kimi-K2, GLM-4.6, DeepSeek-V3.2, etc.) across three difficulty levels.
Failure Mode Analysis: Identification of systematic failure patterns, including:
- Non-scalable Decision Making: Agents fail to expand their decision scope as the environment grows (ignoring many SKUs).
- Incomplete Information Coverage: Agents ignore critical signals like recent reviews or return rates.
- Temporal Instability: High variance in execution strategies between consecutive days.
- Hallucinations & Invalid Actions: Fabricating non-existent SKUs, dates, or issuing economically irrational orders (e.g., negative quantities, extreme prices).

4. Results

Performance Comparison

Framework Efficacy: The Evolving Strategy & Execution framework consistently outperformed baselines (Reflection, Plan-and-Act) across all models.
- Example (Easy Environment): GPT-5.2 with the proposed framework achieved 81 days of operation vs. 64 days with Day-Level Reflection.
- Metrics: The proposed framework yielded higher average daily sales/income and significantly lower product expiry and return ratios.
Model Capacity: Larger, closed-source models (e.g., GPT-5.2, Grok-4.1) generally outperformed smaller or open-source models, particularly in stability and handling complex contexts.
Gap to Optimality: Despite improvements, all LLM agents performed significantly worse than a hand-crafted heuristic policy (the upper bound), highlighting fundamental limitations in current LLMs for long-horizon tasks.

Impact of Difficulty

As environment complexity increased from Easy to Hard, all models exhibited performance degradation:
- Reduced operational duration (fewer days survived).
- Increased product expiry and return ratios.
- Decline in sales and profit per category, indicating poor resource allocation in high-dimensional spaces.

Analysis of Failure Modes

Scalability: Agents did not proportionally increase the number of SKUs they considered as the environment expanded.
Information Gaps: Agents heavily relied on supplier prices and inventory but consistently underutilized customer reviews and return data, despite these being strong predictors of sales performance.
Instability: Execution strategies showed high temporal variability (low similarity between Day $t$ and Day $t+1$ ), leading to inconsistent operational behaviors.

5. Significance and Conclusion

The paper demonstrates that while structured agent frameworks can mitigate some instability, current LLMs are not yet robust enough for autonomous, long-horizon economic decision-making in dynamic, multi-factor environments.

Theoretical Insight: The study reveals that the bottleneck is not just model size but the architecture of decision-making. The separation of strategy and execution is crucial, yet even with this separation, agents struggle with information integration and hallucination control.
Future Directions: The authors suggest that future research must focus on:
- Multi-agent coordination and competitive markets.
- Learning-based adaptation (RL, fine-tuning) rather than just prompting.
- Mechanisms to enforce economic constraints and factual grounding to prevent hallucinations.

RetailBench serves as a principled testbed for advancing research in agentic reasoning, moving beyond short-horizon tasks toward the complex, sustained autonomy required for real-world economic participation.