Construct, Merge, Solve & Adapt with Reinforcement Learning for the min-max Multiple Traveling Salesman Problem

Imagine you are the manager of a fleet of delivery trucks. You have a central warehouse (the depot) and hundreds of customers scattered across a city. Your goal isn't just to get everything delivered; it's to make sure no single driver is overworked. You want to minimize the time the longest route takes, ensuring a fair balance for everyone. This is the Min-Max Multiple Traveling Salesman Problem.

The paper introduces a new, smart way to solve this puzzle called RL-CMSA. It's a hybrid method that mixes old-school math with modern "learning" AI. Here is how it works, explained through a simple story.

The Problem: The "Fairness" Puzzle

Imagine you have 10 drivers and 500 houses to visit. If you just let them drive randomly, one driver might end up with a 10-hour shift while another finishes in 2 hours. That's bad. You need to split the city into 10 perfect loops so that the longest loop is as short as possible.

Doing this by hand is impossible. Even computers struggle because the number of ways to arrange those routes is astronomical (like trying to find a specific grain of sand in all the beaches on Earth).

The Solution: The "Smart Workshop" (RL-CMSA)

The authors built a digital workshop that solves this problem in four repeating steps. Think of it as a team of chefs trying to create the perfect menu, but they keep learning from their mistakes.

1. Construct: The "Clustering Chef"

First, the computer tries to guess how to group the houses.

The Old Way: Just throw dice to decide which house goes to which driver.
The New Way (RL-CMSA): The computer uses a "memory bank" (Reinforcement Learning). It remembers which pairs of houses usually end up in the same route in good solutions.
- Analogy: Imagine a chef who knows that "Pizza" and "Garlic Bread" always go well together. If the chef sees a new order, they automatically put those two items in the same box. The computer does this with cities, using "Q-values" (a score of how good a pair is) to guide the grouping.

2. Merge: The "Recipe Book"

The computer generates many different groupings (like trying 20 different ways to split the city). It takes the best routes from all these attempts and puts them into a Pool (a recipe book).

It's picky: If a route is too long or visits the same houses as a better route already in the book, it gets thrown out. The pool stays compact and high-quality.

3. Solve: The "Mathematical Architect"

Now, the computer stops guessing and starts using a powerful, exact math solver (like a super-precise architect).

It looks at the "Recipe Book" (the pool of good routes) and asks: "If I pick exactly 10 routes from this book, which combination covers every house exactly once and keeps the longest route the shortest?"
This is a strict math problem (MILP) that guarantees the best possible combination from the options available.

4. Adapt & Learn: The "Coach"

This is the magic part. After the math solver finds a great solution, the computer acts like a coach:

Adapt: It keeps the good routes in the pool and deletes the old, bad ones (like pruning a garden).
Learn: It looks at the winning solution and says, "Hey, these two cities were neighbors in the winner! Let's give them a high score so we group them together next time."
If a pair of cities was never together in a good solution, the computer lowers their score.

Why is this better than the competition?

The paper compares this new method (RL-CMSA) against the current "champion" algorithm (a Genetic Algorithm, which is like evolution: it breeds solutions and keeps the fittest).

The Genetic Algorithm (HGA): It's like a wild explorer. It tries many different paths, but it can get lost or wander aimlessly. It finds good solutions, but sometimes it takes a long time and isn't consistent.
RL-CMSA: It's like a guided tour. Because it "learns" from its own successes, it knows exactly where to look.
- Result: On big, difficult problems (lots of cities and many drivers), RL-CMSA finds better, fairer routes faster. It is more consistent, meaning if you run it 40 times, it finds the best answer almost every time, whereas the old method might only find it a few times.

The One Weakness

The new method shines when there are many drivers (many short routes). When there are very few drivers (meaning each driver has a huge, long route), the "Mathematical Architect" step gets a bit stuck because there are fewer ways to mix and match the pieces. But for most real-world scenarios (like delivery fleets with many trucks), the new method is the clear winner.

In a Nutshell

The paper presents a system that builds potential solutions, merges the best parts, solves the math perfectly, and learns from the results to get smarter every time. It's a team of a creative builder, a strict mathematician, and a smart coach working together to ensure every delivery driver gets a fair, efficient shift.

1. Problem Definition

The paper addresses the Min–Max Multiple Traveling Salesman Problem (min–max mTSP).

Context: An extension of the classic TSP where $m$ salesmen start and end at a common depot, collectively visiting $n$ customers exactly once.
Objective: Unlike the standard min-sum mTSP (minimizing total distance), the min–max variant aims to minimize the length of the longest tour. This reflects real-world requirements for workload balancing in applications like last-mile delivery, multi-robot patrolling, and UAV sortie planning.
Complexity: The problem is NP-hard. Exact algorithms scale poorly, necessitating heuristic or metaheuristic approaches for large instances.

2. Methodology: RL-CMSA

The authors propose RL-CMSA, a hybrid metaheuristic that integrates the Construct, Merge, Solve & Adapt (CMSA) framework with Reinforcement Learning (RL). The algorithm operates in an iterative loop consisting of six phases:

A. Construct (Probabilistic Clustering)

Instead of random construction, the algorithm generates $n_{solutions}$ candidate solutions using a two-stage process guided by learned Q-values (representing the desirability of city pairs being in the same route):

Seeding: Selects $m$ cluster centers using a $k$ -means++ style procedure biased by Q-values to ensure centers are well-separated and likely belong to different optimal routes.
Assignment: Assigns remaining cities to clusters based on angular distance, insertion cost, and a Q-compatibility factor (mean Q-value between the city and current cluster members).
Route Construction: Builds initial routes for each cluster using a greedy best-insertion heuristic, followed by intra-route local search (2-opt and Or-opt).
Inter-route Improvement: Applies remove, shift, and swap operators specifically targeting the longest route to balance loads.

B. Merge

Routes from the constructed solutions are added to a candidate pool ( $R_{cand}$ ).

Deduplication: Only the shortest route for a specific set of cities is kept.
Pruning: Routes longer than the current best solution's max-length are discarded to prevent biasing the learning phase.
Ageing: Each route is assigned an age (initialized to 0).

C. Solve (Exact Optimization)

The algorithm formulates a Set-Covering Mixed-Integer Linear Program (MILP) using the routes in $R_{cand}$ .

Goal: Select exactly $m$ routes that cover all cities at least once while minimizing the maximum route length.
Solver: Solved using CPLEX. The output is a potentially infeasible solution (cities may be covered multiple times), which is refined in the next step.

D. Improve

The MILP output is refined into a feasible solution:

Remove: Eliminates duplicate cities by removing them from routes where their removal yields the highest improvement.
Shift & Swap: Performs cross-route relocation (1-move) and exchange (1-1 swap) moves. The selection of moves balances exploitation (greedy best moves) and exploration (roulette-wheel selection based on exponential weighting of improvement).

E. Learn (Reinforcement)

The algorithm updates the Q-values based on the quality of the best solution found ( $R_{best}$ ):

Reinforcement: If a city pair $\{i, j\}$ appears together in a high-quality route, its Q-value is decreased (making them more likely to be clustered together in future iterations).
Discouragement: If a pair does not appear in the best solution, its Q-value increases.
Convergence Check: If Q-values stagnate, they are reset to 0.5 to restart exploration.

F. Adapt

The candidate pool is updated:

New routes from $R_{best}$ are inserted with age 0.
Routes in $R_{cand}$ not present in $R_{best}$ have their age incremented.
Routes exceeding a maximum age threshold ( $age_{max}$ ) are pruned, ensuring the pool remains compact and relevant.

3. Key Contributions

Hybrid Framework: Successfully integrates exact optimization (MILP) with constructive heuristics guided by Reinforcement Learning (Q-learning) for the min–max mTSP.
Learning-Driven Construction: Introduces a novel mechanism where pairwise Q-values guide the clustering phase, effectively learning which city pairings are beneficial for workload balancing.
Robustness: The method demonstrates superior stability compared to Genetic Algorithms, consistently finding near-optimal solutions across multiple runs.
Adaptive Pool Management: The "Adapt" phase with age-based pruning ensures the search space remains dynamic and focused on high-quality route components.

4. Experimental Results

The authors evaluated RL-CMSA against a state-of-the-art Hybrid Genetic Algorithm (HGA) on two benchmark sets:

Random Instances: 20 instances each for $n \in \{50, 100, 200\}$ and varying numbers of salesmen ( $m \in \{1\%, 5\%, 10\%, 15\% \text{ of } n\}$ ).
TSPLIB Instances: Standard benchmarks (eil51, berlin52, eil76, rat99).

Key Findings:

Solution Quality: RL-CMSA generally outperforms HGA in terms of mean objective value and the frequency of finding the best solution (#b).
Scalability: The performance gap widens as instance size ( $n$ ) and the number of salesmen ( $m$ ) increase. RL-CMSA is particularly effective when $m$ is large because shorter routes allow the MILP solver to recombine components more effectively.
Statistical Significance: Paired Wilcoxon signed-rank tests confirm RL-CMSA is statistically superior for $n=100$ and $n=200$ (especially for $m \ge 5\%$ ).
Efficiency: RL-CMSA is consistently faster for $n=50$ and $n=100$ . For $n=200$ , it is faster when $m$ is high (15%), though HGA is slightly faster for very small $m$ (1%).
Search Dynamics: Analysis via Search Trajectory Networks (STN) reveals that RL-CMSA converges rapidly to a single high-quality region of the search space, whereas HGA explores more broadly but with higher variability and less consistency in reaching the global optimum.

5. Significance

This work demonstrates that Reinforcement Learning can effectively guide constructive heuristics in complex combinatorial optimization problems. By learning pairwise co-occurrences, RL-CMSA reduces the search space intelligently, allowing the exact solver (MILP) to work more efficiently. The method offers a robust alternative to traditional Genetic Algorithms, particularly for large-scale, workload-balancing routing problems where consistency and solution quality are critical. The paper suggests future work will focus on learning higher-order route features and expanding to more general routing constraints.