Learning to Solve Orienteering Problem with Time Windows and Variable Profits

Imagine you are the manager of a fleet of delivery robots in a busy factory. Your goal is simple: get the most value done in the least amount of time.

But here's the catch:

The Route: You can't visit every single machine. You have to pick the best ones to visit.
The Time Windows: Some machines are only available for a few minutes (like a meeting room that's booked). If you miss the window, you can't go in.
The "Variable Profit": This is the tricky part. The more time you spend fixing a machine, the more value you get. But spending too much time on one machine means you might miss out on visiting three other machines entirely.

This complex puzzle is called the Orienteering Problem with Time Windows and Variable Profits (OPTWVP). It's a nightmare for computers because they have to make two types of decisions at once:

Discrete decisions: "Which machines do I visit?" (Yes or No).
Continuous decisions: "How long do I spend at each machine?" (1.2 minutes? 4.5 minutes?).

Most computer programs struggle with this because they try to solve the "which" and the "how long" at the same time, getting tangled in a mess of possibilities.

Enter DeCoST: The "Two-Step Dance"

The authors of this paper propose a new method called DeCoST (Decoupled discrete-Continuous optimization with Service-time-guided Trajectory). Think of DeCoST not as a single brain trying to do everything, but as a two-step dance between a "Strategist" and a "Tactician."

Step 1: The Strategist (The "Rough Draft")

First, the AI acts like a Strategist. It looks at the map and quickly sketches out a route.

It picks the machines to visit.
It guesses a rough amount of time to spend at each one.

Crucially, this Strategist is trained to be smart. It doesn't just guess randomly; it learns from experience (using a technique called Reinforcement Learning) to understand that spending too much time here might mean missing a huge opportunity there. It creates a "feasible" plan that respects the time windows.

Step 2: The Tactician (The "Fine-Tuner")

Once the route is drawn, the AI hands the plan to the Tactician.

The Tactician doesn't change the route. The "which machines" decision is locked in.
Instead, the Tactician solves a pure math problem (Linear Programming) to figure out the perfect amount of time to spend at each machine to maximize the total reward.

Why is this brilliant?
It's like writing a novel.

Old way: You try to write the plot, the dialogue, the character arcs, and the grammar all in one sentence. It's chaotic.
DeCoST way: First, you write the plot outline (Step 1). Then, you go back and polish the sentences and dialogue to perfection (Step 2). By separating the "big picture" from the "details," the computer solves the problem much faster and better.

The Secret Sauce: "The Repulsive Coach"

There's a clever trick in Step 1. Usually, if you teach a computer to solve a math problem, it gets lazy and just copies the answer from the math solver (Step 2). This is bad because it stops learning how to make good guesses.

The authors added a "Repulsive Coach" (called pTAR).

Imagine a coach who yells, "Don't just copy the final answer! Try to guess the spirit of the answer!"
This coach pushes the Strategist to make guesses that are different from the perfect math solution, forcing it to learn the underlying patterns of the problem. This ensures the Strategist gets really good at making the initial route, not just copying the math.

The Results: Fast and Accurate

The paper tested DeCoST against the best existing methods (both human-designed algorithms and other AI models).

Speed: DeCoST is incredibly fast. For a problem with 500 machines, it found a great solution in 1.3 seconds. The best competing method took 8.8 seconds (and that was with a time limit!). That's a 6x speedup.
Quality: It found better routes than the others, getting closer to the theoretical "perfect" score.

The Bottom Line

DeCoST is like a master chef who knows that you can't chop vegetables and bake a cake at the exact same time.

First, you prep the ingredients and decide the menu (The Route).
Then, you focus entirely on the perfect cooking time for each dish (The Service Time).

By separating these tasks but keeping them connected, DeCoST solves a very difficult logistics problem faster and more accurately than ever before, making it perfect for real-world applications like robot factories, delivery drones, and emergency response teams.

1. Problem Definition: OPTWVP

The paper addresses the Orienteering Problem with Time Windows and Variable Profits (OPTWVP), a complex variant of the Vehicle Routing Problem (VRP).

Core Challenge: Unlike standard routing problems, OPTWVP involves hybrid decision variables:
1. Discrete: Selecting a subset of nodes to visit and determining the order (routing).
2. Continuous: Allocating service time at each visited node.
Constraints:
- Time Windows: Nodes are only accessible within specific time intervals $[s_i^-, s_i^+]$ .
- Variable Profits: The reward collected at a node is not fixed; it is a linear function of the service time allocated ( $Profit = p_i \times d_i$ ).
- Time Budget: The total travel and service time must not exceed a global limit.
Difficulty: The discrete routing and continuous service time allocation are tightly coupled. A change in the route affects the feasible service time windows for subsequent nodes, and the allocated service time affects the total reward and feasibility of the route. This interdependency causes an exponential expansion of the search space, making traditional solvers inefficient and existing Neural Combinatorial Optimization (NCO) methods (which often focus only on routing) suboptimal.

2. Methodology: DeCoST Framework

The authors propose DeCoST (DEcoupled discrete-Continuous optimization with Service-time-guided Trajectory), a learning-based two-stage framework designed to decouple and coordinate these variables.

Stage 1: Parallel Decoding (Discrete & Initial Continuous)

Architecture: Uses a Transformer-based encoder-decoder structure with Spatial Encoding (incorporating edge features like distances as attention biases) to capture graph connectivity.
Parallel Decoders:
1. Routing Decoder: Predicts the next node to visit.
2. Service Time Decoder (STD): Simultaneously predicts the initial service time allocation ratio ( $\delta \in [0,1]$ ) for the selected node.
Feasibility Masking: Dynamically masks nodes that would violate time window constraints or exceed the total time budget, ensuring the generated trajectory is feasible.
Output: A feasible trajectory $\tau$ and an initial service time allocation $\hat{d}$ .

Stage 2: Service Time Optimization (STO)

Decoupling: Once the discrete path $\tau$ is fixed in Stage 1, the problem simplifies to a Linear Programming (LP) problem: maximizing total profit by optimizing service times $d$ subject to time window and budget constraints.
Algorithm: The authors introduce a custom Service Time Optimization (STO) algorithm (Algorithm 1) that solves this LP efficiently in parallel.
- It iteratively assigns maximum possible service time to nodes with the highest profit density, respecting the "bottleneck" constraints imposed by downstream time windows.
Theoretical Guarantee: The paper provides a rigorous proof (Theorem 4.1) that the STO algorithm yields the global optimum for the service time allocation given a fixed trajectory.

Supervised Learning Mechanism (pTAR)

Challenge: Training the Stage 1 model to predict good service times is difficult because the "optimal" service time depends on the global route structure, which the model doesn't know yet.
Solution: The authors introduce a Profit-weighted Time Allocation Ratio (pTAR) metric:
$pTAR(d) = \sum_{i \in \tau} \frac{p_i d_i}{t_i}$
This measures the profit efficiency per unit of travel cost.
Repulsive Supervisory Loss: A loss term $L_{pTAR} = -(pTAR(\hat{d}) - pTAR(d^*))^2$ is added to the training objective. This encourages the Stage 1 STD to predict service times that align with the globally optimal allocation ( $d^*$ ) found in Stage 2, preventing the model from converging to a local, deterministic optimum and improving long-horizon structure estimation.

3. Key Contributions

Novel Framework (DeCoST): A two-stage approach that effectively decouples discrete routing and continuous service time allocation while maintaining learnable coordination between them.
Global Optimality Proof: Theoretical proof that the proposed STO algorithm guarantees the global optimum for service time allocation on a fixed path.
Repulsive Supervisory Signal: The introduction of the pTAR metric and associated loss function to guide the neural network in learning the complex trade-off between travel time and service time allocation.
State-of-the-Art Performance: Demonstrated superior performance over both heuristic/meta-heuristic algorithms and existing NCO methods in terms of solution quality and inference speed.

4. Experimental Results

The method was evaluated on OPTWVP benchmarks with varying node counts ( $n=50, 100, 500$ ) and time window sizes.

Solution Quality:
- DeCoST consistently outperformed baselines (Gurobi, Greedy-PRS, ILS, POMO, GFACS).
- On $n=100, TW=100$ , DeCoST achieved a Gap of 1.97% compared to the optimal Gurobi solution, whereas the best meta-heuristic (ILS) had a gap of 4.2%.
- On large-scale instances ( $n=500$ ), DeCoST maintained a gap of 3.31% compared to Gurobi's 0.00%, while other methods degraded significantly (e.g., POMO at 28.8%).
Computational Efficiency:
- DeCoST is significantly faster than iterative search methods.
- On $n=500$ , DeCoST solved instances in 1.3 seconds, compared to 8.8 seconds for ILS (a 6.6x speedup).
- It achieved up to 45x speedup over ILS on smaller instances while maintaining higher solution quality.
Ablation Studies:
- Removing the STO module caused the optimality gap to jump from ~1% to ~2.3% (TW=100) and ~1.9% (TW=500), proving the critical role of the second-stage optimization.
- Adding the pTAR loss further reduced the gap, confirming the value of the supervised guidance mechanism.
Robustness: The method showed high stability across different problem instances, with a lower standard deviation in optimality gaps compared to baselines.

5. Significance

Bridging Discrete and Continuous: This work addresses a critical gap in Neural Combinatorial Optimization where most methods struggle with mixed discrete-continuous variables. DeCoST provides a scalable, learnable framework for such problems.
Real-World Applicability: The OPTWVP models real-world scenarios like robotic assembly, logistics with variable service requirements, and team scheduling. The ability to solve these problems efficiently and near-optimally has immediate practical value.
Efficiency vs. Quality Trade-off: The paper demonstrates that learning-based methods can surpass traditional meta-heuristics not just in speed, but also in solution quality, challenging the notion that fast inference must come at the cost of optimality.
Generalizability: The framework is shown to be compatible with various constructive solvers (tested on POMO and GFACS) and extends to Team OPTWVP (multiple vehicles), suggesting broad applicability to other VRP variants.