Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests

Imagine you are the manager of a fleet of small, shared taxis (like a modern, high-tech version of a school bus) that pick people up and drop them off all over a city. This is called Microtransit.

Your job is tricky because you have to deal with two conflicting needs:

The Passenger: "Hey, I need a ride in 2 hours. Can you promise me right now that you'll pick me up?" They need an answer immediately.
The Manager: "Wait, if I promise that ride, will I have enough time to fit it in with everyone else's rides? Maybe if I wait 10 seconds to look at the whole map, I can fit in two more people instead of just one."

The Problem: The "Yes/No" Dilemma

In the past, computer systems for these services had to choose one of two bad options:

Option A (The Fast Promise): They answer the passenger instantly. "Yes, you're in!" But once they say yes, they are stuck with that plan. They can't rearrange the bus seats later to make room for better trips. This leads to a lot of "No" answers later because the bus gets full too fast.
Option B (The Perfect Planner): They wait a long time, looking at all the requests to build the perfect bus route. This is great for efficiency, but the passenger has to wait forever for an answer. In the real world, people get impatient and leave if you don't say "Yes" or "No" quickly.

The Gap: No one had a system that could say "Yes" instantly and still keep rearranging the bus to make it even better later.

The Solution: The "Instant Promise, Continuous Rearrangement" System

The authors of this paper built a new system that acts like a super-smart, two-brained conductor.

1. Brain One: The "Quick-Thinker" (Prompt Confirmation)

When a passenger asks for a ride, this brain acts like a fast-food cashier. It looks at the current bus schedule and asks, "Can I squeeze this new order in without breaking the rules?"

It doesn't try to solve the whole day's puzzle. It just checks: "If I put this person here, does it fit?"
It gives an answer in a fraction of a second (0.2 seconds!).
The Magic: It uses a special "gut feeling" (trained by AI) to know that saying "Yes" to this person now won't ruin the chance of serving 10 other people later.

2. Brain Two: The "Master Planner" (Continual Optimization)

Once the passenger gets their "Yes," the Master Planner wakes up. Imagine a chess player who has just made a move. While the opponent is thinking, the chess player is already looking 10 moves ahead.

Between the time one passenger asks for a ride and the next one arrives, this brain is constantly shuffling the bus routes.
It tries to swap passengers, change pickup orders, and move buses around to make the whole system more efficient.
It uses a technique called "Simulated Annealing" (think of it like shaking a box of puzzle pieces to see if they fit better). It keeps shaking the puzzle until a new request comes in, at which point it stops and locks in the best arrangement it found so far.

The Secret Sauce: The "Crystal Ball" (Reinforcement Learning)

How does the "Quick-Thinker" know that saying "Yes" now is a good idea? It doesn't just look at the current bus; it looks into the future.

The authors trained the system using Reinforcement Learning.

The Analogy: Imagine training a dog. If the dog sits, you give it a treat. If it jumps on the couch, you say "No."
In this paper: The computer played a simulation game millions of times. Every time it made a decision (Accept/Reject) that led to serving more people in the long run, it got a "digital treat."
Over time, the computer learned a non-myopic (long-sighted) strategy. It learned that sometimes, taking a slightly "messy" route now is actually better because it saves space for a huge rush of requests coming in an hour.

The Results: Why It Matters

The team tested this on real data from a US city and New York City taxi data.

Speed: It answers passengers almost instantly (under 1 second).
Efficiency: It rejected far fewer requests than the old systems. While other systems might say "No" to 10% of people, this new system said "No" to only about 1%.

The Big Picture

Think of this system as a traffic controller for a busy airport.

Old systems were like controllers who either gave a landing slot immediately and never moved planes again (causing delays later), or controllers who waited 20 minutes to calculate the perfect landing sequence (making pilots wait on the runway).
This new system says, "You have a landing slot! Go!" (Instantly). Then, while the plane is taxiing, the controller is already reorganizing the other planes on the runway to make sure everyone lands smoothly and on time.

In short: This paper gives us a way to promise rides instantly without sacrificing the efficiency of the whole fleet, making on-demand public transport actually viable for everyday use.

1. Problem Definition

The paper addresses a specific gap in Dynamic Vehicle Routing Problems (DVRP) within the context of on-demand microtransit and public transportation services.

The Core Challenge: Transit agencies must handle trip requests that arrive sequentially (stochastically) while vehicles are already in operation.
The Specific Gap: Existing approaches fall into two categories, neither of which fully satisfies real-world operational needs:
1. Prompt Confirmation: Algorithms that immediately accept/reject requests and assign them to vehicles but lack the ability to continually optimize routes, leading to suboptimal service rates.
2. Continual Optimization: Algorithms that continuously re-optimize routes to maximize service rates but delay confirmation, leaving passengers uncertain about their trip status.
The Objective: The authors propose a unified framework that achieves both prompt confirmation (deciding within seconds) and continual optimization (improving routes between requests) to maximize the long-term service rate (the ratio of accepted and served requests).
Constraints: The system must satisfy strict time windows (earliest pickup, latest drop-off), vehicle capacity limits, and travel-time constraints. Crucially, once a request is accepted, the agency must guarantee it can be served; therefore, a feasible route plan must exist at the moment of acceptance.

2. Methodology

The authors propose a novel computational approach that integrates a Quick Insertion Search for immediate decisions with an Anytime Algorithm for continuous improvement, guided by a Reinforcement Learning (RL) objective function.

A. Formal Model (MDP)

The problem is formulated as a Markov Decision Process (MDP):

State ( $s_t$ ): Includes vehicle locations, current route manifests, set of accepted requests, and the newly arrived request.
Action ( $a_t$ ): A two-part decision:
1. Accept/Reject: Decide whether to accept the new request.
2. Route Update: Generate a new set of feasible route plans ( $R_{post}$ ) for all vehicles.
Reward: $1 $if a request is accepted,$ 0$ otherwise. The goal is to maximize the cumulative reward (long-term service rate).
Objective Function: Instead of a myopic heuristic (e.g., "maximize current capacity usage"), the system learns a non-myopic Action-Value function ( $Q(s, a)$ ). This function predicts the long-term value of a specific acceptance/routing decision, accounting for future stochastic requests.

B. Two-Stage Computational Approach

The solution operates in two distinct phases triggered by request arrivals:

Prompt Confirmation (Quick Insertion):
- Trigger: Occurs immediately when a new request arrives.
- Mechanism: A "Quick Insertion" algorithm searches for a feasible insertion of the new request into existing routes without reordering existing stops (to ensure speed).
- Optimization: It selects the insertion that maximizes the learned $Q$ -value.
- Performance: Runs in a fraction of a second (< 1s), providing immediate feedback to the passenger.
Continual Optimization (Anytime Algorithm):
- Trigger: Occurs in the idle time between consecutive request arrivals.
- Mechanism: A Simulated Annealing metaheuristic continuously refines the route manifests.
- Operations: It uses mutation operators (Swap, Move, Shift, Reverse) to explore the solution space.
- Termination: The algorithm is "anytime," meaning it can be interrupted at any moment (when the next request arrives) and return the best feasible solution found so far.

C. Reinforcement Learning & Feature Engineering

Learning Algorithm: The authors use Q-learning to approximate the optimal policy.
Pre-training: To mitigate the high computational cost of pure RL, they employ Supervised Pre-training. A simple policy (always accept if feasible, maximize idle time) generates 1 million experiences to pre-train the neural network.
Neural Architectures: They tested Multi-Layer Perceptrons (MLP), Kolmogorov-Arnold Networks (KAN), and Convolutional Neural Networks (CNN).
Feature Vectors: To handle the unstructured state space, they designed fixed-length feature vectors representing:
- Total idle time.
- Temporal availability (vehicles idle per time interval).
- Spatio-temporal availability (idle vehicles in specific grid cells over time).

3. Key Contributions

Novel Problem Formulation: Defined the "Dynamic VRP with Prompt Confirmation and Continual Optimization," bridging the gap between immediate user feedback and long-term system efficiency.
Hybrid Algorithmic Framework: Successfully integrated a fast, deterministic insertion search for real-time decisions with a stochastic, anytime metaheuristic for continuous improvement.
Non-Myopic Objective via RL: Demonstrated that using a learned $Q$ -function (trained via RL) as the objective for both the insertion and optimization phases significantly outperforms traditional myopic heuristics.
Real-World Validation: Validated the approach using a real-world microtransit dataset from a U.S. public transit agency, alongside the standard NYC taxi dataset.

4. Experimental Results

The approach was evaluated against three baselines: Google OR-Tools, Rolling Horizon (RH), and Monte Carlo VRP (MC VRP).

Service Rate (Rejection Rate):
- The proposed approach ( $\pi^*$ ) achieved a rejection rate of ~1% on the microtransit dataset.
- It significantly outperformed all baselines. For instance, while OR-Tools had low confirmation times, it suffered from higher rejection rates due to its lack of continual optimization.
Confirmation Time:
- The prompt confirmation step took an average of 0.2 seconds (microtransit) and 1 second (NYC data).
- This is comparable to OR-Tools (0.1s) and significantly faster than Rolling Horizon (50s+), making it viable for real-time passenger interaction.
Ablation Studies:
- Continual Optimization: Results showed that increasing the runtime of the anytime algorithm between requests drastically reduced rejection rates, proving the value of the "idle time" optimization.
- Learned Q-Function: Replacing the learned $Q$ -function with simple heuristics resulted in higher rejection rates, confirming the necessity of the non-myopic, RL-guided objective.

5. Significance

This work is significant for the deployment of on-demand microtransit services for several reasons:

User Trust: It solves the "uncertainty" problem for passengers by providing immediate confirmation while guaranteeing that accepted trips will actually be served.
Operational Efficiency: It maximizes fleet utilization and service coverage by continuously re-optimizing routes as new data arrives, rather than locking in suboptimal routes immediately.
Scalability: The combination of a fast insertion heuristic and an anytime optimizer allows the system to scale to real-world request volumes without sacrificing decision quality or speed.
Practical Applicability: Unlike many theoretical RL papers that use simplified simulations, this approach is validated on real-world road networks and request patterns, demonstrating readiness for deployment in public transit agencies.