Sequential Service Region Design with Capacity-Constrained Investment and Spillover Effect

Imagine you are the CEO of a massive food delivery company. You want to expand your service to cover an entire country, but you have a limited budget and a small team. You can't open restaurants in every city tomorrow. You have to choose where to open first, when to open them, and how many to open at once.

This paper is about solving that exact puzzle, but with a high-tech twist. It tackles a problem called Sequential Service Region Design. Here is the breakdown in simple terms:

1. The Big Problem: The "Too Many Choices" Trap

Imagine you have 10 cities to conquer. You can open 2 or 3 at a time.

If you open City A first, does that make City B more popular? (Maybe people in City A start ordering from City B).
If you wait too long to open City C, do you lose customers to a competitor?
If you open too many at once, do you run out of money?

The number of possible ways to order these cities is astronomical. It's like trying to find the perfect path through a maze that has billions of branches. If you try to calculate every single path to find the "best" one, your computer would take years to finish.

2. The Two Hidden Rules

The authors added two real-world rules that make the puzzle harder but more realistic:

The "K-Region" Limit: You can't just pick one city at a time. You have a rule that says, "You can open at most K cities in a single year." This changes the game from picking a single city to picking a team (or portfolio) of cities every year.
The "Spillover" Effect: This is the magic ingredient. When you open a restaurant in one city, it doesn't just help that city. It creates a "ripple effect." Maybe people in a neighboring city start ordering more because they see your brand is growing nearby. The paper treats this ripple effect as a random, exciting surprise that changes the future demand.

3. The Solution: A "Time-Traveling" AI Coach

To solve this, the authors built a smart system that combines two powerful ideas:

A. Real Options Analysis (The "Time-Traveler")
In finance, a "Real Option" is like having a coupon that lets you buy something later if the price is right.

Imagine you have a coupon to open a restaurant in 2026. You don't have to use it today. You wait and see if the city becomes popular.
The paper uses a math method (called LSMC) to calculate the value of waiting. It asks: "Is it worth opening now, or should I wait to see if the 'ripple effect' makes the city more valuable later?"

B. The AI Coach (TPPO)
Since there are too many paths to check, they trained an AI using Deep Reinforcement Learning. Think of this AI as a chess grandmaster who learns by playing millions of games.

The Secret Sauce: They didn't just use a standard AI. They used a Transformer (the same technology behind tools like ChatGPT).
Why a Transformer? A standard AI might look at cities one by one. A Transformer looks at the whole map at once, understanding how City A relates to City B, City C, and the whole network. It understands the "ripple effects" much better.
The Training: The AI plays the game over and over. Every time it picks a group of cities, the "Time-Traveler" math (Real Options) tells it: "Good job! That sequence gave you a high value because you waited for the right moment." The AI learns to repeat those good moves.

4. What Did They Discover? (The "Aha!" Moments)

After running thousands of simulations on real data from Shanghai, Beijing, and New York, they found some surprising things:

Don't Rush the Big Fish: It seems logical to open in the biggest, busiest cities first. But the AI found the opposite! It's often better to start in smaller, quieter cities. Why? Because the "big" cities are so valuable that you want to keep the option to open them later, when you know more. Opening them too early locks you in.
The "Goldilocks" Speed: You shouldn't open too few cities (too slow) or too many (too risky). There is a "sweet spot" (usually opening 4 or 5 regions at a time) where you get the most value.
Teamwork Matters: The AI learned that certain cities should be opened together because they boost each other's demand. It's like opening a gym and a smoothie shop next to each other; they work better as a pair.
The AI Wins: When compared to simple strategies (like "always pick the biggest city" or "always pick the cheapest"), the AI found solutions that were 30% to 50% more profitable.

The Bottom Line

This paper teaches us that expanding a service network isn't just about picking the best locations. It's about timing and flexibility.

By using a smart AI that understands how cities influence each other (the spillover) and values the ability to wait (real options), companies can grow faster, spend less, and make more money than if they just guessed or followed old rules. It's the difference between blindly running a race and having a coach who knows exactly when to sprint and when to hold back.

1. Problem Definition

The paper addresses the Sequential Service Region Design (SSRD) problem, which involves determining the optimal timing and geographic sequence for deploying service networks (e.g., delivery zones, ridesharing areas) under uncertainty.

Key Challenges:

Sequential Deployment: Due to capital and operational constraints, firms cannot deploy all regions simultaneously. They must decide when and where to invest over a finite planning horizon.
$k$ -Region Constraint: Unlike previous studies that often assume single-region sequencing, this paper introduces a constraint where at most $k$ regions can be invested in per time period. This transforms the problem from a simple permutation of regions into a complex combinatorial portfolio selection problem.
Stochastic Spillover Effect: Demand is not static. Investments in one region generate spillover effects that stochastically increase demand in connected regions (both intra- and inter-regional). This creates a state-dependent feedback loop where current decisions alter future demand distributions.
Combinatorial Explosion: The interaction between the $k$ -region constraint and spillover-driven demand dynamics leads to an exponential growth in feasible investment sequences, making exhaustive enumeration computationally intractable for realistic problem sizes.

2. Methodology

The authors propose a hybrid framework integrating Real Options Analysis (ROA) with Deep Reinforcement Learning (DRL).

A. Mathematical Formulation

MDP Framework: The problem is modeled as a Markov Decision Process (MDP) under a real options framework.
- State ( $S$ ): Includes investment status of regions, the partially constructed sequence, and time remaining.
- Action ( $A$ ): Selection of a portfolio of up to $k$ uninvested regions.
- Demand Dynamics: Modeled using a Geometric Brownian Motion with Poisson Jumps (GBMPJ). The standard GBM captures temporal volatility, while the Poisson jump component ( $\eta \cdot f(I_t)$ ) captures the discrete, stochastic demand surges induced by network expansion (spillover).
Objective: Maximize the Real Option Value (ROV) of the investment sequence, which accounts for the flexibility to defer, expand, or adapt decisions as uncertainty unfolds.

B. Solution Approach: TPPO

To solve the high-dimensional MDP without exhaustive enumeration, the authors develop a Transformer-based Proximal Policy Optimization (TPPO) algorithm.

ROA as a Reward Evaluator:
- Since the true option value depends on the entire future trajectory, the authors use the Least Squares Monte Carlo (LSMC) method to evaluate the option value ( $V_{ROA}$ ) of any given investment sequence.
- This evaluation provides the reward signal for the DRL agent. The reward for adding a portfolio is the marginal increase in option value: $r_n = V_{ROA}(l_{h+1}) - V_{ROA}(l_h)$ .
TPPO Architecture:
- Transformer Encoder: Used to capture complex inter-regional dependencies. Unlike standard RNNs, Transformers handle the set of regions effectively.
- Hierarchical State Embedding:
  - Local Features: Investment status, demand ratios, and baseline demand.
  - Global Context: Time steps and aggregate investment ratios.
  - Identity Embedding: Learnable embeddings for specific regions to overcome the permutation invariance of standard self-attention, ensuring consistent tracking of specific geographic units.
- Dual-Head Policy Network:
  - Quantity Head: Predicts the size of the portfolio to select.
  - Selection Head: Uses residual fusion of raw features and transformer outputs to select specific regions, with masking applied to prevent selecting already-invested regions.
- Symmetric Critic: A value network with a global skip connection to efficiently capture dominant linear trends (like time decay) while the transformer captures complex residuals.

3. Key Contributions

Novel Problem Formulation: Introduces the $k$ -region constraint to SSRD, shifting the focus from sequencing individual regions to sequencing portfolios, which better reflects real-world operational limits.
Endogenous Demand Modeling: Incorporates a stochastic spillover effect where investment decisions actively reshape future demand dynamics via a GBMPJ process, capturing network externalities often ignored in prior literature.
Hybrid Algorithm (TPPO): Proposes a novel integration of ROA (for valuation) and Transformer-based DRL (for policy learning). This allows the system to learn high-value investment sequences directly without exhaustive search.
Scalability: Demonstrates that the proposed method solves large-scale instances (7–9 regions) with near-optimal accuracy in a fraction of the time required by enumeration methods.

4. Experimental Results

The study was validated using data from Shanghai, Beijing, and New York City (MoD service expansion).

Performance vs. Benchmarks:
- vs. Enumeration: In 6–7 region scenarios, TPPO achieved an average optimality gap of only 1.31% compared to exhaustive enumeration but reduced runtime by ~94% (e.g., from ~11,000s to ~280s).
- vs. Other DRL: TPPO converged faster and achieved higher asymptotic rewards than standard PPO, SAC, and Transformer-augmented SAC (TSAC).
- vs. Heuristics: TPPO outperformed myopic heuristics (Myopia-H and Myopia-L) by 13.9% to 51.6% in option value across various scenarios.
Sensitivity Analysis:
- $k$ -Constraint: The value of increasing $k$ (concurrency) is non-monotonic. Moderate concurrency ( $k=4$ or $5$) yields the highest option value; aggressive expansion ( $k=6$ ) often reduces value by sacrificing intertemporal flexibility.
- Spillover Effect: TPPO's advantage over heuristics widens significantly as spillover intensity increases. The DRL policy effectively captures cumulative network benefits that myopic policies miss.
- Dynamic Costs: When costs decline over time (scale effects), TPPO strategically delays investments to exploit lower future costs, outperforming benchmarks that fail to internalize these intertemporal dynamics.
Managerial Insights:
- Bottom-Up Expansion: Optimal strategies often invest in smaller, lower-demand regions first ("quick wins") and defer high-demand regions to later stages to preserve option value.
- Selective Concurrency: High-value sequences show that co-investment is selective (e.g., specific region pairs are often deployed together) rather than uniform.

5. Significance

This paper bridges the gap between financial theory (Real Options) and operational research (Service Design) using advanced Deep Learning.

Theoretical: It advances the literature by modeling demand as an endogenous, state-dependent process driven by investment-induced spillovers, moving beyond exogenous stochastic assumptions.
Practical: It provides a scalable, adaptive tool for platform operators (e.g., Uber, Amazon, logistics firms) to navigate capital constraints and market uncertainty. The findings suggest that moderate, staged expansion tailored to network effects yields superior long-term value compared to aggressive, one-shot deployment or rigid myopic strategies.

Sequential Service Region Design with Capacity-Constrained Investment and Spillover Effect

1. The Big Problem: The "Too Many Choices" Trap

2. The Two Hidden Rules

3. The Solution: A "Time-Traveling" AI Coach

4. What Did They Discover? (The "Aha!" Moments)

The Bottom Line

1. Problem Definition

2. Methodology

A. Mathematical Formulation

B. Solution Approach: TPPO

3. Key Contributions

4. Experimental Results

5. Significance

More like this

DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph

How unconstrained machine-learning models learn physical symmetries

Experiential Reflective Learning for Self-Improving LLM Agents

Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks

Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions