Multi-Objective Reinforcement Learning for Large-Scale Tote Allocation in Human-Robot Collaborative Fulfillment Centers

This paper proposes a Multi-Objective Reinforcement Learning framework grounded in minimax policy learning and no-regret dynamics to optimize tote allocation in human-robot collaborative fulfillment centers, effectively balancing competing objectives like speed and space utilization while satisfying operational constraints in large-scale industrial settings.

Sikata Sengupta, Guangyi Liu, Omer Gottesman, Joseph W Durham, Michael Kearns, Aaron Roth, Michael Caldara

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine a massive, high-tech warehouse where robots and humans work side-by-side, like a dance team trying to pack millions of boxes (called "totes") into shelves. This is the world of Amazon's "Sequoia" system.

The paper you're asking about tackles a very specific, tricky problem in this dance: The Consolidation Shuffle.

The Problem: The "Too Full, Too Empty" Dance

Imagine your closet is a mess. You have some hangers with just one shirt, and others stuffed with ten. To make space for new clothes, you need to move shirts from the crowded hangers to the empty ones.

In the warehouse, this is called consolidation.

  • The Goal: Move items from partially full totes to other totes to free up space for new inventory.
  • The Catch: You have two types of dancers: Humans and Robots.
    • Humans are flexible. They can pick up weird, heavy, or fragile items. But they get tired, and they are expensive (they take up space in the workflow).
    • Robots are fast and consistent. But they can only handle specific types of items. If you give them a weird item, they drop it or get stuck.

The warehouse manager has to decide: Who moves this specific item? Which tote goes where?

If they make the wrong choice, the robots get stuck, the humans get overwhelmed, or the warehouse runs out of space. They need to balance three competing goals:

  1. Speed: Get things done fast.
  2. Space: Don't let the shelves get too full or too empty.
  3. Balance: Don't let the robot queue get too long while the humans sit idle (or vice versa).

The Old Way: The "Weighted Score" Trap

Traditionally, managers tried to solve this by creating a single "score." They would say, "Speed is worth 50 points, Space is worth 30 points, and Balance is worth 20."

The problem? It's rigid.

  • If the warehouse is suddenly flooded with new orders, they need to prioritize speed, but the score is still stuck on the old weights.
  • If they change the weights manually, they might accidentally break the system (e.g., prioritizing speed so much that the robots crash).
  • It's like trying to drive a car by only looking at the speedometer, ignoring the gas gauge and the engine temperature.

The New Way: The "Tug-of-War" Game

The authors propose a smarter way using Multi-Objective Reinforcement Learning (MORL). Instead of one boss giving orders, they set up a Tug-of-War game between two characters:

  1. The Learner (The Agent): This is the AI trying to move the totes. Its goal is to be as fast and efficient as possible.
  2. The Regulator (The Referee): This is a smart overseer watching the Learner. Its job is to make sure the Learner doesn't break the rules (like overloading the robots or running out of space).

How the game works:

  • Round 1: The Learner tries to move totes as fast as it can. It ignores the rules a bit to get a high score.
  • The Referee's Move: The Referee sees the Learner is breaking the rules (e.g., "Hey, you're overloading the robot station!"). The Referee slaps a "penalty" on the Learner.
  • Round 2: The Learner tries again. Now, it has to balance speed with the penalty. It learns to slow down just enough to keep the robots happy.
  • The Dance: They repeat this thousands of times. The Learner gets faster, and the Referee gets smarter about how much to penalize.

Eventually, they reach a Perfect Balance (Minimax Equilibrium). The Learner finds a strategy that is fast and respects all the rules, without anyone needing to manually tell it "Speed is 50% important." The system figures out the perfect mix on its own.

The Magic Trick: The "Oscillating" Solution

Here is the clever part the paper discovered.

In theory, the "perfect" solution is a mixture. Imagine the AI says: "For the next 10 minutes, I'll be super fast but risky. For the next 10 minutes, I'll be super slow but safe." If you average those two, you get a perfect result.

But in real life, you can't switch back and forth every 10 minutes; it's chaotic. You need one single strategy that works all the time.

The paper proves that even though the AI is technically "oscillating" between different strategies during training, if you look closely, it naturally stumbles upon a single, steady strategy that satisfies all the rules perfectly. It's like a tightrope walker who wobbles left and right while learning, but eventually finds a steady path that keeps them from falling.

The Results: Why It Matters

The team tested this in a super-accurate computer simulation of a real Amazon warehouse.

  • Old methods (single-objective) either crashed the robots or moved too slowly.
  • The new method learned a single policy that:
    • Kept the robots from getting overloaded.
    • Kept the humans from getting bored.
    • Moved items faster than the random baseline.
    • Crucially: It did all this without a human ever having to tweak the settings.

The Takeaway

This paper shows that we can teach AI to manage complex, real-world systems (like warehouses) by letting it play a game of "push and pull" against a virtual referee. Instead of hard-coding rules, the AI learns to find the "sweet spot" where speed, safety, and efficiency all coexist. It's a major step toward making our future warehouses run smoother, faster, and with less human guesswork.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →