Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation

Imagine a massive, high-tech warehouse filled with hundreds of autonomous robots. Their job is to zip around, pick up packages, and deliver them to shipping docks. This is the world of Warehouse Automation.

The problem? When hundreds of robots move at once, they get in each other's way. It's like rush hour traffic on a highway, but with robots. If they aren't coordinated perfectly, they get stuck in gridlock, slowing down the whole operation and costing the company money.

This paper introduces a new, smarter way to manage this traffic using a mix of old-school rules and modern AI.

The Problem: The "Traffic Jam" Dilemma

In the past, scientists tried to solve this in two ways:

The "Perfect Planner" (Search-Based): This tries to calculate the perfect path for every single robot at once. It's like a super-genius traffic controller trying to direct every car on Earth simultaneously. It works great for small groups, but when you have 100+ robots, the math gets so heavy the computer crashes or takes too long.
The "Random Order" (Prioritized Planning): This is simpler. It says, "Okay, Robot A goes first, then Robot B, then Robot C." It's fast, but if you pick the wrong order (e.g., sending a robot into a crowded hallway first), you create a jam that ruins the whole system.

The Solution: The "Smart Traffic Cop" (RL-RH-PP)

The authors created a hybrid system called RL-RH-PP. Think of it as a team consisting of a Fast Runner and a Smart Coach.

1. The Fast Runner (The Backbone)

They kept the "Prioritized Planning" method because it's fast and simple. This is the Runner. It just needs a list of who goes first, second, and third, and it can quickly draw the paths. But the Runner is blind; it doesn't know which order is best.

2. The Smart Coach (The AI)

This is where the magic happens. They added a Reinforcement Learning (RL) AI, which acts as the Coach.

How it learns: The Coach watches the warehouse. It sees where the robots are, where they are going, and where the traffic is getting tight.
The "Rolling Horizon" Trick: Instead of planning the whole day at once (which is impossible), the Coach plans in short chunks, like looking 20 seconds into the future. As time moves, the Coach updates its plan, just like a driver adjusting their route when they see a new traffic jam ahead.
The Decision: The Coach doesn't just pick a random order. It uses a neural network (a brain-like computer model) to figure out: "If I let Robot #42 go first, it will block the aisle. But if I let Robot #15 go first, it clears the path for everyone else."

The Secret Sauce: The "Backtracking" Move

The most fascinating part of this paper is what the AI learned to do that humans wouldn't naturally think of.

Imagine a narrow hallway where two robots are stuck facing each other.

A human planner might say, "Robot A is closer to the exit, so let Robot A go."
The AI Coach realized that sometimes, the robot closest to the exit should actually back up.

By letting the robot near the exit step backward (even though it seems counter-intuitive), it clears a "parking spot" for the robot stuck in the middle to squeeze past. Once the middle robot passes, the first robot can move forward again. The AI learned that short-term backward steps create long-term forward speed.

Why This Matters

The researchers tested this in two types of warehouses:

Amazon-style: Lots of open space, but many robots.
Symbotic-style: Very crowded, with narrow aisles and lots of obstacles (like a maze).

The Results:

25% More Efficiency: The AI-guided system moved 25% more packages than the standard methods.
It Got Smarter Over Time: As the AI saw more traffic jams, it got better at predicting them before they happened.
Zero-Shot Learning: The AI was trained on one specific warehouse layout. When they dropped it into a completely different layout (different aisle sizes, different robot counts) without retraining, it still worked better than the old methods. It was like teaching a driver to drive in New York, and then having them immediately drive perfectly in Tokyo without a map.

The Big Picture

This paper proves that we don't have to choose between "fast but dumb" algorithms and "smart but slow" ones. By using AI to make the decisions about who goes first, and letting a fast, simple algorithm do the heavy lifting of drawing the paths, we get the best of both worlds.

It's the difference between a chaotic crowd of people trying to leave a stadium and a well-organized crowd where a smart usher directs the flow, knowing exactly when to let a group step back so the whole line can move forward faster.

1. Problem Definition

The paper addresses Lifelong Multi-Agent Path Finding (Lifelong MAPF) in the context of warehouse automation. Unlike traditional "one-shot" MAPF, where agents move from a start to a single goal and stop, Lifelong MAPF involves agents continuously receiving new tasks (goals) upon completing previous ones.

Key Challenges:

Long-term Dynamics: Decisions made at the current time step have cascading effects on future feasibility. Myopic planning can lead to congestion, deadlocks, and reduced system throughput over time.
Scalability vs. Optimality: Classical optimal solvers (e.g., Conflict-Based Search, CBS) do not scale well to large fleets (hundreds of agents) due to exponential complexity. Heuristic methods (e.g., Prioritized Planning) are scalable but rely heavily on the quality of the priority order; poor ordering leads to suboptimal paths.
Limitations of Existing Learning: While Reinforcement Learning (RL) has been explored, it has not consistently outperformed search-based methods in lifelong settings, often failing to capture the complex sequential dependencies required for long-horizon coordination.

2. Methodology: RL-RH-PP

The authors propose RL-guided Rolling Horizon Prioritized Planning (RL-RH-PP), a hybrid framework that integrates Deep Reinforcement Learning (RL) with a classical search-based planner.

A. Core Architecture

The framework consists of two main components:

Rolling Horizon Prioritized Planning (RH-PP): A backbone planner that extends standard Prioritized Planning (PP) to a dynamic setting.
- It operates in discrete episodes defined by a planning horizon ( $w$ ) and an execution horizon ( $h$ ).
- At each replanning step, it generates a total priority order for all agents.
- Agents are planned sequentially based on this order using a single-agent solver (SIPP), treating higher-priority agents as dynamic obstacles.
- A Top- $K$ sampling mechanism is used: $K$ candidate priority orders are generated, evaluated, and the best one is selected. A repair mechanism ensures collision-free execution within the window.
Reinforcement Learning Policy (The "Brain"):
- Formulation: The task of selecting the optimal priority order is modeled as a Partially Observable Markov Decision Process (POMDP).
- Observation: The state input consists of the shortest paths for all agents from their current locations to their future goal sequences. This captures spatiotemporal information without needing full state history.
- Action: The policy outputs a set of $K$ promising total priority orders.
- Reward Function: Designed to maximize throughput and minimize congestion. It penalizes:
  - Remaining distance to goals ( $d_{i,t}$ ).
  - Congestion (agents forced to wait, $c_{i,t}$ ).
  - Infeasibility (failure to find a path, $s_{i,t}$ ).

B. Neural Network Design

The RL policy utilizes a Transformer-style architecture to handle the sequential and spatial nature of the problem:

Encoder:
- Uses a dictionary-based position embedding to map map coordinates to learnable vectors, enabling zero-shot transfer to unseen layouts of the same size.
- Employs alternating Temporal and Spatial Attention layers:
  - Temporal Attention: Captures long-term dependencies along each agent's trajectory.
  - Spatial Attention: Models interactions between different agents at specific time steps.
Decoder:
- Uses autoregressive decoding to generate priority orders. It sequentially selects agents one by one to form a permutation (priority order).
- Leverages multi-head attention to determine the probability distribution of the next agent to prioritize based on the current context and previously selected agents.

C. Training

Algorithm: Proximal Policy Optimization (PPO) is used for training.
Process: The policy interacts with a warehouse simulation environment. It generates priority orders, which are fed to RH-PP to compute paths. The resulting execution (rewards and next states) updates the policy.
Data Efficiency: The rollout dataset is reused for multiple policy updates to improve sample efficiency.

3. Key Contributions

First Hybrid Framework: Introduces the first framework integrating RL for dynamic priority assignment with search-based Prioritized Planning for Lifelong MAPF.
RH-PP Backbone: Proposes Rolling Horizon Prioritized Planning as an efficient, scalable backbone that allows the RL component to focus solely on optimizing the global priority order rather than pathfinding.
Neural Architecture: Designs a specialized Transformer encoder-decoder that captures both spatial (agent interactions) and temporal (trajectory dependencies) contexts to learn effective prioritization strategies.
Superior Performance: Demonstrates that learning to predict global priority orders significantly outperforms random sampling and strong search-based baselines.
Interpretability: Provides visual analysis (heatmaps, movement traces) showing that the RL policy learns to proactively prioritize congested agents and strategically "backtrack" agents to resolve deadlocks.

4. Experimental Results

The method was evaluated in two realistic warehouse simulations: Amazon (moderate density) and Symbotic (high obstacle density, complex layout).

Throughput Improvement: RL-RH-PP achieved an average 25% higher total throughput compared to RH-PP with random priority orders.
Comparison with Baselines:
- Outperformed classical solvers (RH-CBS, RH-PBS) and state-of-the-art pipelines (WPPL, PIBT) in high-density scenarios.
- While PIBT is faster, it yields significantly lower throughput in congested environments due to its myopic, decentralized nature.
- RL-RH-PP maintained robust performance even as agent density increased, whereas search-based methods (like RH-PBS) degraded significantly.
Zero-Shot Generalization:
- A policy trained on a specific agent count ( $N=120$ ) and planning horizon ( $w=20$ ) generalized effectively to different agent counts (40–140), different planning horizons, and unseen map layouts (e.g., swapped docks, different aisle lengths) without retraining.
Ablation Studies:
- Confirmed the necessity of both temporal and spatial attention mechanisms.
- Showed that the RL policy learns to avoid infeasible priority orders better than heuristic-based selectors.
- Demonstrated that long-horizon planning (RL) is crucial compared to short-horizon (contextual bandit) approaches.

5. Significance and Impact

Bridging the Gap: The paper successfully bridges the gap between the scalability of heuristic search and the adaptability of machine learning. It proves that learning does not need to replace search but can effectively guide it.
Real-World Applicability: By focusing on warehouse automation (specifically Symbotic layouts), the work addresses real-world constraints like high obstacle density and continuous task flows, moving beyond theoretical benchmarks.
Congestion Management: The study reveals that RL can learn complex, non-intuitive strategies (e.g., temporarily moving an agent away from its goal to clear a bottleneck) that are difficult to encode in static heuristics.
Future Directions: The authors suggest that this learning-guided approach could be extended to joint task assignment and path planning, and scaled to thousands of agents via parallelized evaluation.

In summary, RL-RH-PP represents a significant advancement in multi-robot coordination, offering a robust, scalable, and high-throughput solution for the complex, dynamic challenges of modern warehouse automation.