Efficient Decoder Scaling Strategy for Neural Routing Solvers

Imagine you are trying to solve a massive puzzle: the Traveling Salesman Problem. You need to find the shortest possible route for a delivery driver to visit 100 (or even 1,000) cities and return home. Doing this by hand is impossible, and even computers struggle with it.

Recently, scientists started using AI (specifically "Neural Routing Solvers") to learn how to solve these puzzles automatically. Think of these AI solvers as a team of two:

The Encoder (The Map Reader): Looks at the map and understands where the cities are.
The Decoder (The Driver): Decides which city to visit next, step-by-step.

The Big Question

For a long time, researchers thought the "Map Reader" (Encoder) needed to be huge and powerful, while the "Driver" (Decoder) could be small. But recent studies suggested the opposite: maybe the Driver needs more brainpower.

However, everyone kept the Driver small (about the size of a small smartphone app). The big question was: What happens if we make the Driver really big? Does it just get better and better, or is there a catch?

The Experiment: Building Bigger Drivers

The authors of this paper built 12 different versions of this AI "Driver," ranging from tiny (1 million parameters) to massive (150 million parameters). They tested two ways to make the driver bigger:

Wider: Give the driver a bigger brain (more neurons per layer) but keep the number of thinking steps the same.
Deeper: Give the driver more layers of thinking (more steps to process information) but keep the brain size per step smaller.

The Surprising Discovery: Depth > Width

The results were like a plot twist in a movie.

The "Wide" Approach (More Neurons):
Imagine trying to solve a maze by giving a person a giant, wide-open room to think in. They have lots of space, but they only get to take one step before making a decision.

Result: It helps a little, but you hit a wall quickly. Adding more "width" is like adding more furniture to a room; eventually, it just gets cluttered without helping you solve the maze faster.

The "Deep" Approach (More Layers):
Now, imagine giving that person a long hallway with many mirrors. They can look at the problem, think, look again, think again, and refine their answer step-by-step.

Result: This worked amazingly well. The deeper models solved the puzzles much faster and with higher accuracy.

The Analogy:
Think of it like studying for a test.

Scaling Width is like reading a textbook with huge font and wide margins. You can see more words at once, but you might not understand the deep connections.
Scaling Depth is like reading the same book, but then reading a summary, then reading a critique, then teaching it to a friend. You are processing the same amount of information, but you are thinking about it more times.

The Three Golden Rules

Based on this "Depth is King" discovery, the authors gave us three simple rules for building better AI:

Go Deep, Not Wide: If you have a limited budget for building an AI, don't make it wide and shallow. Make it deep and narrow. A 100-layer AI with small neurons will beat a 6-layer AI with giant neurons every time.
Deep Models Learn Faster: If you don't have a lot of training data (like a student with only one textbook), a deep model is better at squeezing every drop of knowledge out of that single book. A wide model needs a library to learn the same amount.
Match Depth to Your Time:
- If you need an answer fast (like a delivery driver in traffic), use a medium-depth model. It's a good balance.
- If you have plenty of time (like planning a route for next week), use a super-deep model. It will find the absolute perfect route, even if it takes a bit longer to think.

Why This Matters

This paper changes how we build AI for logistics, chip manufacturing, and delivery routes. Instead of just throwing more money at bigger, wider models, we should build taller, deeper models.

The Bottom Line:
If you want your AI to be a genius at solving complex routing puzzles, don't just give it a bigger brain; give it more time to think by stacking more layers of intelligence on top of each other. Depth beats width.

1. Problem Statement

The paper addresses the Traveling Salesman Problem (TSP) and broader Vehicle Routing Problems (VRP) using Neural Combinatorial Optimization (NCO). Specifically, it focuses on construction-based neural solvers, which generate solutions autoregressively (node-by-node) using an encoder-decoder architecture.

While recent research has shifted focus from heavy encoders to heavy decoders (shifting parameters to the decoder to improve generalization), existing works typically restrict decoder sizes to a small range (1–3M parameters). The authors identify a critical gap: the effects of scaling the decoder beyond this range are unexplored. It is unclear whether simply increasing the parameter count improves performance, or if the specific distribution of parameters between depth (number of layers) and width (embedding dimension) matters more.

2. Methodology

To investigate scaling behaviors, the authors conducted a systematic empirical study using a decoder-only Transformer architecture (simplifying the encoder to a linear projection to isolate decoder effects).

Experimental Design: They constructed a suite of 12 model configurations by varying two hyperparameters:
- Depth ( $D$ ): 6, 12, 24, 42 layers.
- Width ( $W$ ): 128, 256, 512 embedding dimensions.
- This resulted in a parameter count range from ~1.3M to ~144M.
Training Protocol:
- Dataset: Uniformly distributed TSP instances with 100 nodes (TSP100).
- Scale: A massive dataset of 60 million unique instances (processed once per step) to prevent overfitting and ensure fair comparison.
- Hardware: NVIDIA RTX 4090 GPUs.
Evaluation Metrics:
- Optimality Gap: Percentage difference between the model's solution and the ground truth (computed via LKH3 solver).
- Efficiency Dimensions: The study analyzed performance across three axes:
  1. Parameter Efficiency: Gap reduction vs. parameter count.
  2. Data Efficiency: Gap reduction vs. training dataset size.
  3. Compute Efficiency: Gap reduction vs. inference FLOPs (and wall-clock time).
Scaling Laws: The authors fitted empirical power laws ( $Gap \propto N^{-\alpha}$ ) to quantify the scaling exponents ( $\alpha$ ) for depth and width separately.

3. Key Contributions

First Systematic Study on Decoder Scaling: The paper provides the first comprehensive analysis of scaling decoder-only models in NCO, moving beyond the 1–3M parameter limit.
Discovery of Depth vs. Width Asymmetry: The authors demonstrate that parameter count alone is a poor predictor of performance. Instead, the architectural shape (depth-to-width ratio) is critical.
Quantification of Scaling Laws:
- Depth Scaling: Exhibits a near-linear scaling exponent ( $\alpha_n \approx 0.98–1.05$ ), meaning doubling parameters significantly reduces the gap.
- Width Scaling: Suffers from severe diminishing returns ( $\alpha_n \approx 0.24–0.40$ ).
Proposed Design Principles: Based on the findings, the authors propose three specific principles for efficient resource allocation:
- Parameter Placement: Prioritize Deep-Narrow architectures over Wide-Shallow ones.
- Data Efficiency: Deep models learn faster from limited data, making them superior for data-scarce regimes.
- Compute Allocation: Use medium-depth models for constrained inference budgets and deep models for abundant budgets to maximize solution quality.

4. Key Results

Non-Monotonic Scaling: Simply increasing model size does not guarantee better performance. For example, a 9.0M parameter model (Deep: $D=42$ , Narrow: $W=128$ ) outperformed a 41.5M parameter model (Shallow: $D=12$ , Wide: $W=512$ ).
Superiority of Depth:
- Parameter Efficiency: Depth-prioritized models reduced the optimality gap by ~50% when parameters doubled, whereas width-prioritized models only reduced it by ~15–25%.
- Data Efficiency: Deep models ( $D=42$ ) achieved a scaling exponent of 0.71 on data size, significantly outperforming wide models ( $D=6$ ) which had an exponent of 0.55.
- Compute Efficiency: Under fixed inference budgets, deep models achieved lower gaps than wide models.
Generalization: The deep-narrow models demonstrated superior out-of-distribution (OOD) generalization. On TSP1000 (trained on TSP100), the deep model ( $D=42, W=128$ ) achieved a gap of 0.869% (greedy), significantly beating the wide baseline ( $1.291\%$ ) and other state-of-the-art methods.
State-of-the-Art Performance: The fully scaled deep model ( $D=42, W=512$ ) achieved a gap of 0.493% on TSP1000, marking the first time an end-to-end constructive NCO model broke the 1% error barrier on TSP1000 using purely greedy decoding.

5. Significance

This work fundamentally shifts the paradigm for designing neural routing solvers:

Efficiency over Brute Force: It proves that blindly increasing model size (width) is inefficient. Instead, allocating resources to increase depth yields superior returns in accuracy, data usage, and compute efficiency.
Practical Guidelines: The proposed "Deep-Narrow" design principle offers actionable guidelines for researchers and engineers to build more effective solvers with fewer parameters and less training data.
Theoretical Insight: The analysis of node embeddings (via PCA and cosine similarity) reveals that depth enhances the model's ability to compress non-optimal nodes into a compact cluster, thereby isolating the optimal node more effectively than width does. This suggests depth improves the model's "long-sightedness" and discriminative power.

In conclusion, the paper establishes that for neural routing solvers, depth is the primary driver of performance, and future scaling efforts should prioritize increasing the number of layers over increasing the embedding dimension.

Efficient Decoder Scaling Strategy for Neural Routing Solvers

The Big Question

The Experiment: Building Bigger Drivers

The Surprising Discovery: Depth > Width

The Three Golden Rules

Why This Matters

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank