Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams

Imagine you are the manager of a busy kitchen with a team of robots. Your job is to give them a complex order: "Put the apple in the fridge, turn off the light, and clean up the counter."

Now, imagine the kitchen is cluttered with 50 different items: a toaster, a tomato, a pot, a dusty bin, a loaf of bread, and a knife.

The Problem: The "Too Much Info" Bottleneck

If you ask a standard robot team (or a basic AI) to solve this, they might get overwhelmed. They see everything in the kitchen. They might think, "Wait, do I need the tomato? The bread? The toaster?"

This is like trying to find a specific needle in a haystack while the hay is on fire. The AI gets confused, wastes time thinking about irrelevant objects, and might even hallucinate (make things up), like saying, "I'll put the tomato in the fridge" (even though you asked for an apple) or "I'll open a cabinet that doesn't exist."

This is the problem the paper Scale-Plan solves.

The Solution: The "Smart Filter"

The authors created a system called Scale-Plan. Think of it as a super-smart sous-chef who acts as a filter before the robots even start moving.

Here is how it works, using a simple analogy:

1. The "Action Map" (The Blueprint)

Before the robots ever see the kitchen, the system builds a giant map of connections (called an Action Graph).

Imagine a flowchart that says: "To Slice a tomato, you must first Pick up a knife."
It knows that "Turning off a light" has nothing to do with "Washing a dish."
This map is built from the rules of the world (the PDDL domain), not from the messy kitchen itself. It's the rulebook.

2. The "Shallow Reasoning" (The Quick Glance)

When you give the command ("Put apple in fridge"), the system doesn't look at the whole kitchen. It looks at its Action Map.

It asks the Large Language Model (LLM) a simple question: "What steps do we need for an apple and a fridge?"
The LLM says: "Go to apple, pick it up, go to fridge, open fridge, put apple in, close fridge."
Crucially: It ignores the tomato, the toaster, and the bread. It filters out 90% of the noise.

3. The "Team Huddle" (Task Allocation)

Now that the system knows only the relevant steps, it assigns them to the robots.

"Robot A, you handle the apple."
"Robot B, you go turn off the light."
Because the list is short and clean, the robots don't get confused. They execute the plan perfectly.

Why is this better than what we had before?

Old Way (Pure LLM): Like asking a genius but distracted chef to plan the whole meal while staring at a messy counter. They might forget to open the fridge or grab the wrong vegetable.
Middle Way (LLM + Symbolic): Like asking the chef to write a formal recipe, then handing that recipe to a strict robot. But if the chef wrote the recipe wrong (hallucinated a step), the robot fails.
Scale-Plan: Like giving the chef a filtered checklist. The chef only looks at the items needed for this specific order. The checklist is short, accurate, and impossible to mess up because it's based on the logical rules of the kitchen, not just a guess.

The "MAT2-THOR" Benchmark

The authors also realized that the existing tests for robot planning were messy (like a kitchen with broken instructions). They cleaned it up and created a new, fair test called MAT2-THOR.

It's like taking a messy, confusing exam and rewriting it so the questions make sense and the answers are clear.
On this new test, Scale-Plan crushed the competition, solving complex tasks much more often than the other methods.

The Bottom Line

Scale-Plan is about focus. In a world full of data, the smartest thing a robot can do is ignore the irrelevant stuff. By using a logical map to filter out the noise before planning, it allows teams of robots to work together efficiently, without getting tripped up by the clutter of the real world.

It turns a chaotic, overwhelming task into a simple, step-by-step checklist that even a robot can follow without making mistakes.

Here is a detailed technical summary of the paper "Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams."

1. Problem Statement

The paper addresses the challenge of long-horizon task planning for heterogeneous multi-robot systems in complex, object-rich environments (e.g., household settings).

The Bottleneck: Real-world environments contain vast amounts of perceptual data, much of which is irrelevant to the specific task. Including all objects and capabilities in the planning process creates a massive combinatorial search space, leading to inefficiency and planning failures.
Limitations of Existing Approaches:
- Traditional Symbolic Planners (PDDL): Rely on manually constructed problem specifications, which lack scalability and adaptability in dynamic environments.
- Pure LLM Planners: Suffer from hallucinations, weak grounding (poor alignment with actual objects), and context length limitations, often generating infeasible plans in cluttered scenes.
- Hybrid LLM-PDDL: While promising, they often fail because LLM-generated intermediate PDDL files contain errors (missing constraints, hallucinated entities) due to the difficulty of grounding natural language into precise symbolic formats without filtering irrelevant data first.

2. Methodology: Scale-Plan

Scale-Plan is a scalable framework that combines offline domain structure analysis with online shallow LLM reasoning to filter information before planning. It avoids generating intermediate PDDL problem files, instead synthesizing executable plans directly from a filtered representation.

The framework consists of two main stages:

A. Offline: Action Graph Construction

Input: A PDDL domain specification (defining predicates, actions, and types).
Process: Constructs a directed Action Graph where:
- Nodes: Parameterized action schemas.
- Edges: Logical dependencies between actions.
- Edge Rules:
  1. Strict Edge: Added if the effects of action $a_1$ fully satisfy the preconditions of $a_2$ ( $PRE(a_2) \subseteq EFF(a_1)$ ).
  2. Relaxed Edge: Added if there is a partial overlap ( $PRE(a_2) \cap EFF(a_1) \neq \emptyset$ ) to ensure graph connectivity without over-densification.
Goal: This graph captures the domain's logical structure independent of specific instances.

B. Online: Task-Relevant Filtering & Planning

Given a natural language task instruction and a specific environment (e.g., AI2-THOR):

Shallow LLM Reasoning: The LLM proposes a small set of candidate terminal actions and relevant object parameters based on the task description.
Graph Search (Backward DFS): The system performs a backward depth-first search on the Action Graph starting from the proposed terminal nodes. It identifies the minimal subset of predecessor actions required to satisfy preconditions.
Environment Filtering: This process extracts only the task-relevant objects and actions, discarding irrelevant entities (e.g., ignoring a "tomato" when the task is to move an "apple").
Structured Planning Pipeline: Using this filtered representation, the system executes:
- Task Decomposition: Breaking the high-level instruction into subtasks.
- Task Allocation: Assigning subtasks to specific robots based on their heterogeneous capabilities.
- Plan Integration: Merging sub-plans into a coherent, parallelizable execution strategy.
Plan-to-Code: The final plan is translated into executable code for the simulator (AI2-THOR) without ever generating an intermediate PDDL problem file.

3. Key Contributions

Scale-Plan Framework: A novel architecture that uses an offline Action Graph to guide online LLM reasoning, enabling the extraction of minimal, task-relevant environmental information. This significantly reduces combinatorial complexity.
Direct Plan Synthesis: The system decomposes tasks and allocates robots without relying on error-prone intermediate PDDL problem generation, improving robustness in object-rich settings.
MAT2-THOR Benchmark: The authors introduced a cleaned, standardized benchmark derived from the existing MAT-THOR dataset. It corrects ground-truth errors, removes duplicates, and introduces a num_contains parameter for precise evaluation of multi-agent tasks in AI2-THOR.
Empirical Validation: Comprehensive evaluation showing superior performance over pure LLM and hybrid baselines across task completion, goal recall, and executability.

4. Experimental Results

The framework was evaluated on the MAT2-THOR benchmark (49 tasks: Simple, Complex, Vague) using GPT-5.2.

Performance Metrics:
- Task Completion Rate (TCR): Scale-Plan achieved 78% overall TCR, outperforming the strongest baseline (LaMMA-P LLM-corrected) by 25%.
- Goal Condition Recall (GCR): Achieved 85%, a 16% improvement over the baseline.
- Executability Rate (ER): Achieved 94%, a 9% improvement, indicating fewer low-level execution failures.
Ablation Study: Removing the environment filtering (No-EF) or replacing the Action Graph with simple LLM filtering (LLM-SF) resulted in significant drops in TCR (down to ~65-67%), proving that the structured Action Graph filtering is critical for complex, long-horizon tasks.
Efficiency Trade-off: While Scale-Plan has a higher planning time (~~62s) compared to pure LLM approaches (~~12s) due to multiple LLM inference calls and graph search, this cost is justified by the massive gains in plan quality and success rates.

5. Significance and Future Work

Scalability: Scale-Plan demonstrates that filtering irrelevant information before planning is essential for scaling multi-robot systems to real-world, cluttered environments.
Reliability: By avoiding the "hallucination trap" of generating full PDDL files from raw sensory data, the system produces more reliable and executable plans.
Limitations: The current approach lacks explicit environmental grounding (direct alignment between language and simulator state), which can lead to localization errors. It also struggles with highly vague instructions where object relevance is ambiguous.
Future Directions: The authors plan to integrate structured knowledge graphs to improve grounding and develop replanning mechanisms to recover from execution failures dynamically.

In conclusion, Scale-Plan represents a significant step forward in bridging the gap between high-level language instructions and low-level robotic execution, offering a scalable solution for heterogeneous teams operating in complex, unstructured environments.