Beware of the Classical Benchmark Instances for the Traveling Salesman Problem with Time Windows

Imagine you are a delivery driver with a very specific job: you have a list of customers to visit, and each customer has a strict "open and close" time for their door. You can't arrive too early (you'd have to wait), and you can't arrive too late (you'd be kicked out). Your goal is to figure out the perfect order to visit everyone so you finish your route as quickly as possible.

This is the Traveling Salesman Problem with Time Windows (TSPTW). It's a classic puzzle that computer scientists have been trying to solve for decades.

For years, researchers have used a specific set of "practice exams" (called benchmark instances) to test how good their new computer algorithms are. If an algorithm solves these exams quickly, everyone cheers, and the algorithm is considered a genius.

Here is the twist: This paper argues that those "practice exams" are actually broken. They are like a math test where the questions are so predictable that a student who just memorized the answer key can get 100% in seconds, even if they don't actually understand math.

The "Magic" Shortcut

The author, Francisco Soulignac, built a very simple, almost "dumb" computer program. It doesn't use fancy tricks or super-complex math. Instead, it uses a simple strategy: it works backward.

Imagine you are trying to get home. Instead of planning your route from your house to the store, you start at your front door and ask, "Who could I have just visited to get here on time?" Then you ask that person, "Who did you visit before me?" You keep going backward until you hit the start of the day.

Because the "practice exams" have very tight, predictable time windows (like a schedule that is too rigid), this backward-looking method finds the answer almost instantly.

The Result: The author's simple program solved the hardest "practice exams" (with 50+ customers) in less than 10 seconds.
The Shock: Previous "super-complex" algorithms took minutes or even hours to solve the same problems.

The "Fake Hard" Problem

The paper reveals that these classic benchmarks have a hidden flaw: they are too structured.

Think of it like a maze.

Real-world mazes are messy, with dead ends, loops, and confusing paths.
The "Classic" mazes used in these tests are like a hallway with doors that only open at specific times. Because the doors are so strictly timed, there is only one logical path, and a simple algorithm can just follow the "open" signs backward.

The author found that if you make the time windows "looser" (like giving customers a wider window of time to receive a package), the simple backward algorithm suddenly fails. It gets lost in the messiness. But the old, complex algorithms that were praised for being "smart" also struggle with these looser, more realistic scenarios.

Why This Matters for AI and Machine Learning

This is a big deal for the world of Artificial Intelligence.

The Trap: Many AI researchers train their "smart" delivery algorithms using these same "broken" practice exams. They generate thousands of fake problems that look just like the classic benchmarks.
The Consequence: The AI learns to cheat. It learns to exploit the specific patterns of these fake problems. When you show the AI a real problem with messy, real-world timing, it might fail miserably because it was trained on a "fake" version of reality.

The author warns: "Don't let your AI study for a test that doesn't exist."

The Takeaway

The Old Tests are Broken: The classic benchmarks used for the last 30 years are no longer good at testing if an algorithm is actually smart. They are too easy for simple tricks.
Beware of "Outstanding" Results: If a new algorithm claims to solve these classic problems in record time, it might just be exploiting the flaws in the test, not actually being a better solver.
We Need Harder Tests: To truly test a delivery algorithm (or an AI), we need to use problems with "looser" time windows that mimic the chaos of the real world.
Simple is Sometimes Better: Sometimes, a simple, backward-looking approach works great on structured data, but we need to know its limits so we don't get fooled.

In short: The paper is a wake-up call. It tells the scientific community to stop using the same old, easy practice exams and start testing their algorithms on messy, realistic problems, or else we might be building "smart" systems that are actually just very good at cheating.

Here is a detailed technical summary of the paper "Beware of the Classical Benchmark Instances for the Traveling Salesman Problem with Time Windows" by Francisco J. Soulignac.

1. Problem Definition

The paper addresses the Traveling Salesman Problem with Time Windows (TSPTW), where a vehicle must visit a set of customers within specific time windows $[a(v), b(v)]$ and return to a depot. The study focuses on two specific objective functions:

TSPTW-M (Makespan): Minimizing the completion time of the route (time of return to the depot).
TSPTW-D (Duration): Minimizing the total time elapsed between departure and return (excluding waiting times).

The core issue investigated is the representativeness of classical benchmark instances (specifically those compiled by López-Ibáñez and Blum, 2023) used to evaluate algorithms for these problems. The author argues that these benchmarks have become obsolete for testing the robustness of modern solvers, particularly for instances with 50 or more customers.

2. Methodology

The author proposes a simple, exact, informed search algorithm (Algorithm 1) designed to solve the TSPTW-M, which is then adapted for TSPTW-D.

A. The Core Solver (Algorithm 1: Backward Best-First Search)

Instead of a standard forward search, the algorithm traverses a tree of partial routes in a backward direction (from the end depot to the start).

Search Strategy: It uses a best-first search heuristic prioritizing partial routes that reach the end depot earlier.
State Representation: Each partial route $R$ $R$ is defined by:
- The set of visited vertices $V(R)$ .
- The earliest arrival time at the end depot ( $\delta_m(R)$ ).
- The latest feasible departure time from the start depot ( $\delta^{-1}_m(R)$ ).
Pruning & Dominance:
- Unreachable Vertices: A preprocessing step computes an "unreachable function" $U(w, t)$ to identify vertices that cannot precede a specific node $w$ within a given time, allowing for early pruning of infeasible branches.
- $ub$ -Dominance: The algorithm discards a partial route $R$ if there exists another route $Q$ with the same visited vertices that arrives earlier or allows for a later departure, effectively dominating $R$ regarding the upper bound ( $ub$ ) of the makespan.
Optimization Loop (Algorithm 2): To find the optimal makespan, the solver iteratively runs the decision procedure with decreasing upper bounds ( $ub$ ) until the optimal route is found.

B. Solving TSPTW-D (Algorithm 3)

To solve the Duration objective, the author employs a sliding window approach:

It iterates through possible departure times $t$ .
For each $t$ , it uses the TSPTW-M solver (Algorithm 2) to find the earliest return time.
It maintains a "best-known" solution and updates it if a shorter duration is found.
To improve efficiency, it utilizes a local search heuristic (swap, 2-opt, shift) to update routes and skips departure times that cannot improve the current best solution.

3. Key Contributions

Discovery of Benchmark Bias: The paper demonstrates that classical benchmark instances (e.g., Lan, Dum, Gen, Ohl, DaS) with 50+ customers are structurally biased. They possess tight time windows relative to the planning horizon, creating a small, well-defined search space that is trivial for simple informed search methods.
High-Performance Simple Solver: The author developed a solver that solves all classical benchmark instances with 50+ customers for the TSPTW-M in less than 10 seconds each. For TSPTW-D, it solves all but one instance in under 30 minutes.
Failure on "Hard" Instances: The same simple solver fails to solve instances with fewer customers (e.g., 30–40) if they have loose time windows (generated using the $\beta$ parameter from Rifki et al., 2020). This proves that the difficulty of TSPTW is driven more by time window tightness than by the number of customers.
Implications for Machine Learning: The paper highlights that many Machine Learning (ML) training datasets are generated using the same procedures as these classical benchmarks (tight windows, specific distributions). Consequently, ML models trained on these sets may be overfitting to easy structural patterns rather than learning generalizable routing logic.

4. Computational Results

The author tested the proposed algorithms on 1,337 instances across 10 benchmark sets.

TSPTW-M Results:
- Classical Benchmarks (50+ customers): The proposed solver solved 100% of instances in sets like DaS (125 instances) and Gen (130 instances) with average times under 1 second. It solved the four previously open instances in the Asc set.
- Small/Loose Benchmarks (Rif): The solver failed to solve instances with loose time windows ( $\beta=0, 0.25$ ) even with a 3-minute time limit, often hitting memory limits. In contrast, state-of-the-art methods like Ler22 (Lera-Romero et al., 2022) performed better on these "hard" loose-window instances.
- Anomaly: The solver performed worse on small instances (e.g., Asc-s, <50 customers) than on large ones, contradicting the intuition that larger problems are harder.
TSPTW-D Results:
- The sliding window approach (Algorithm 3) solved all but one of the large classical instances (50+ customers) in under 30 minutes.
- This suggests that for these specific benchmarks, the TSPTW-D is not significantly harder than TSPTW-M, further indicating the benchmarks' lack of difficulty.

5. Significance and Conclusions

Obsolescence of Classical Benchmarks: The paper concludes that the classical benchmark sets (specifically those with 50+ customers) are no longer representative for evaluating TSPTW solvers. Their structure allows simple, non-robust algorithms to achieve "outstanding" results that do not translate to harder, real-world scenarios with loose constraints.
Risk of Misleading Evaluation: Using these benchmarks alone can lead to false conclusions about the efficiency of new algorithms (including ML-based solvers). An algorithm might appear superior simply because it exploits the specific bias of the benchmark (tight windows) rather than possessing general solving power.
Recommendations for Future Research:
- New Benchmarks: Evaluation should shift toward instances with looser time windows (controlled by parameters like $\beta$ ) and varying levels of difficulty, even with fewer customers.
- ML Training: Training datasets for machine learning should avoid the generation procedures of Lan, Dum, Gen, Ohl, and DaS to prevent overfitting to easy structures.
- Realism: The paper questions whether current benchmarks reflect real-world problems, noting that real-world constraints (like online stacker cranes) often involve tight windows, but the distribution of these windows in benchmarks may be artificial.

In summary, the paper serves as a critical warning to the operations research and AI communities: the "hard" instances of the past have become "easy" due to their structural biases, and relying on them risks stunting the development of truly robust solvers.

Beware of the Classical Benchmark Instances for the Traveling Salesman Problem with Time Windows

The "Magic" Shortcut

The "Fake Hard" Problem

Why This Matters for AI and Machine Learning

The Takeaway

1. Problem Definition

2. Methodology

A. The Core Solver (Algorithm 1: Backward Best-First Search)

B. Solving TSPTW-D (Algorithm 3)

3. Key Contributions

4. Computational Results

5. Significance and Conclusions

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation