SLAP: Shortcut Learning for Abstract Planning

Imagine you are trying to clean up a messy room. You have a strict set of rules (a "manual") that tells you exactly how to move things: Pick up a toy, Walk to the bin, Drop the toy.

If you have a tower of blocks blocking a specific spot, the manual says: "Pick up block A, move it. Pick up block B, move it. Pick up block C, move it." You do this one by one until the spot is clear. It works, but it takes forever.

A clever child, however, wouldn't follow the manual step-by-step. They would pick up the toy they need, then slap the whole tower of blocks aside with their hand, clearing the space in one go, and then drop the toy. It's faster, messier, and definitely not in the manual, but it gets the job done.

This paper is about teaching robots to be that clever child.

Here is the breakdown of SLAP (Shortcut Learning for Abstract Planning) in simple terms:

1. The Problem: The Robot is Too Rigid

Current robots are great at following a "To-Do List" (called Task and Motion Planning). They can break a big job into small, logical steps like "Pick," "Place," and "Move."

The Catch: The robot can only do what humans explicitly programmed it to do. If the robot needs to "slap" a tower of blocks to clear a path, it can't do that because "slapping" isn't on its list of approved moves. It will try to move the blocks one by one, which is slow and inefficient.

2. The Solution: SLAP (The Robot's "Aha!" Moment)

The authors created a system called SLAP. Think of SLAP as a smart coach that watches the robot try to solve problems and says, "Hey, you're doing it the hard way. Let's try a shortcut."

Here is how SLAP works, using a Video Game Analogy:

The Map (Abstract Planning): Imagine a map of a video game level. The robot knows the "official" paths (the long, winding roads where it moves one block at a time).
The Cheat Codes (Shortcuts): SLAP uses a technique called Reinforcement Learning (trial and error) to find "cheat codes" or "portals" on the map.
- Instead of walking from Point A to Point B, the robot learns a new move: "If I hold this block and spin my arm, I can knock the whole tower over."
- This new move is a Shortcut. It connects two points on the map that were previously far apart.

3. How It Learns (The Training Camp)

SLAP doesn't just guess; it practices.

Identify the Gap: It looks at the "official map" and sees a long, boring path between two states (e.g., "Holding the target block" and "The floor is clear").
Create a Mini-Game: It isolates just that specific problem and creates a tiny, focused training environment.
Practice: The robot tries thousands of random moves in this mini-game. Eventually, it stumbles upon a cool, dynamic move (like a "slap," "wiggle," or "wipe") that clears the path instantly.
Save the Move: Once the robot masters this new move, SLAP saves it as a new "option" on the main map.

4. The Result: Faster and Smarter

When the robot faces a new, real-world task:

Old Way: It follows the long, winding road of the manual.
SLAP Way: It looks at the map, sees the new "portal" (shortcut) it learned, and jumps straight through it.

In the experiments:

The robot solved tasks 50% to 73% faster than the old way.
It succeeded more often than robots that tried to learn the whole task from scratch without any rules (which is like trying to learn to drive by just spinning the wheel randomly).
It discovered moves humans never programmed, like slapping a tower of blocks or wiping a table with a tool to gather toys.

Why This Matters

This is a bridge between two worlds:

Planning: The logical, step-by-step thinking of a computer (good for long, complex tasks).
Learning: The creative, trial-and-error flexibility of a human (good for finding fast, clever solutions).

SLAP lets a robot keep its logical brain but give it the ability to "improvise" physically. It's like giving a robot a rulebook, but then teaching it how to break the rules when it leads to a faster, better result.

In short: SLAP teaches robots to stop being rigid bureaucrats and start being creative problem-solvers, finding the "slap" instead of the "step-by-step."

1. Problem Statement

The paper addresses the challenge of long-horizon decision-making in robotics, characterized by sparse rewards, continuous state/action spaces, and complex physical interactions.

Limitations of Task and Motion Planning (TAMP): Traditional TAMP frameworks use hierarchical planning with pre-defined "options" (abstract actions like pick, place, move). While TAMP excels at long-horizon reasoning, it is limited by the human-engineered set of options. It often produces suboptimal, inefficient plans because it assumes rigid physical interactions (e.g., single-object contact) and cannot discover dynamic, improvisational behaviors (e.g., "slapping" a tower of blocks aside).
Limitations of Reinforcement Learning (RL): Pure model-free RL struggles with long-horizon tasks due to the "curse of dimensionality" and sparse reward signals, often failing to converge on successful policies for complex manipulation tasks.
The Gap: There is a need for a system that retains the structural reasoning of TAMP but can autonomously discover new, efficient low-level behaviors (shortcuts) that transcend pre-defined assumptions, without requiring a full re-learning of the task from scratch.

2. Methodology: Shortcut Learning for Abstract Planning (SLAP)

SLAP is a hybrid framework that leverages existing TAMP options to automatically discover new "shortcut" options using model-free Reinforcement Learning (RL).

Core Concept

SLAP operates on the insight that the abstract planning graph induced by existing options contains "gaps" between abstract states that can be bridged by learning new low-level policies. Instead of learning a full policy for the entire task, SLAP learns specific shortcuts (options) between pairs of abstract states.

The Pipeline

Abstract Planning Graph Construction:
- The system builds a two-level graph. The top level consists of abstract states (symbolic representations) and edges representing pre-defined options. The bottom level consists of continuous environment states and actions.
- A standard planner (e.g., BFS/Dijkstra) finds a path through the top level, which is then mapped to a trajectory in the bottom level.
Shortcut Discovery (Training Phase):
- Candidate Generation: The system identifies pairs of abstract states $(s_{init}, s_{term})$ that are not directly connected by existing options but are reachable via a sequence of existing options. These represent potential "shortcuts."
- Pruning: To avoid wasting resources on impossible shortcuts, the system performs random rollouts. If a shortcut cannot be reached within a few random attempts, it is pruned.
- RL Training: For each promising candidate, a separate Markov Decision Process (MDP) is instantiated. The goal is to transition from $s_{init}$ $s_{ini t}$ to $s_{term}$ $s_{t er m}$ as quickly as possible.
  - Reward: Sparse reward of $-1$ per step (encouraging speed).
  - Algorithm: Proximal Policy Optimization (PPO) is used to learn the policy $\pi_\theta$ for each shortcut.
  - Generalization: To handle varying numbers of objects, SLAP uses object substitution. It identifies "relevant atoms" (relations) for a shortcut and projects the state based only on those objects. During evaluation, it maps new objects to the training objects if the symbolic relations match.
Evaluation (Inference Phase):
- When a new task arrives, the abstract planner is run again, but this time the learned shortcut policies are added to the set of available options.
- The planner searches the graph again. If a learned shortcut allows for a shorter path (fewer steps) than the sequence of original options, the planner selects it.
- Robustness Check: If a shortcut fails during execution (e.g., collision), the edge is temporarily pruned, and the planner replans.

3. Key Contributions

Novel Framework: SLAP is the first method to use RL specifically to learn low-level skills that improve the execution time of an existing abstract planner, bridging the gap between symbolic planning and continuous control.
Dynamic Improvisation: The method discovers behaviors that violate standard TAMP assumptions, such as "slapping" a tower of blocks, "wiggling" a hand to clear clutter, or "sweeping" multiple objects at once.
Generalization:
- Object Generalization: By relying on symbolic relations (atoms) rather than specific object IDs, SLAP generalizes to tasks with different numbers of objects and different physical properties (mass, friction).
- Goal Generalization: The planner can reuse learned shortcuts for entirely new goals by finding different paths in the abstract graph.
Efficiency: It navigates the spectrum between pure planning and pure RL. If shortcuts are hard to learn, it defaults to planning; if easy, it collapses to pure RL; otherwise, it finds an optimal middle ground.

4. Experimental Results

The authors evaluated SLAP in four simulated robotic environments (Obstacle 2D, Obstacle Tower, Cluttered Drawer, Cleanup Table) featuring long horizons and sparse rewards.

Performance vs. Pure Planning: SLAP consistently reduced plan lengths by 32% to 73% compared to pure TAMP. For example, in the "Obstacle Tower" task, the plan length dropped from ~246 steps to ~79 steps.
Performance vs. Pure RL: Pure RL (PPO, SAC+HER) and Hierarchical RL baselines failed to solve most tasks (0% success rate) due to the sparsity of rewards and task complexity. SLAP achieved 100% success rates across all environments.
Sample Efficiency: SLAP is significantly more sample-efficient than training a full RL policy from scratch because it decomposes the problem into smaller, manageable MDPs (shortcuts).
Ablation Studies:
- Pruning: The random rollout pruning mechanism was found to be highly effective, removing >98% of infeasible shortcut candidates.
- Policy Learning: Training independent policies for each shortcut (SLAP's default) outperformed attempts to learn a single universal policy (Abstract Subgoals/HER), likely because shortcuts vary in difficulty.
- Robustness: SLAP maintained high performance even when tested on environments with stochastic physics (noise) and partially observable states (occluded objects), outperforming pure planning in success rates.

5. Significance

Bridging Symbolic and Subsymbolic: SLAP demonstrates a practical path to unifying the long-horizon reasoning capabilities of TAMP with the physical flexibility and improvisational skills of RL.
Plug-and-Play: The method acts as a modular enhancement to existing TAMP systems. It does not require re-engineering the entire planner or the environment; it simply learns to "fill in the gaps" between existing skills.
Real-World Applicability: By discovering dynamic interactions (like pushing multiple objects simultaneously), SLAP moves robotic manipulation closer to human-like adaptability, where agents do not just follow rigid scripts but improvise based on the physical state of the world.
Scalability: The approach scales to complex 3D environments with irregular objects (Objaverse), suggesting potential for real-world deployment where object properties and counts vary dynamically.

In summary, SLAP represents a significant step forward in robotic autonomy by enabling agents to learn how to move more efficiently within a structured planning framework, effectively turning "satisficing" (good enough) plans into optimal, dynamic solutions.

SLAP: Shortcut Learning for Abstract Planning

1. The Problem: The Robot is Too Rigid

2. The Solution: SLAP (The Robot's "Aha!" Moment)

3. How It Learns (The Training Camp)

4. The Result: Faster and Smarter

Why This Matters

1. Problem Statement

2. Methodology: Shortcut Learning for Abstract Planning (SLAP)

Core Concept

The Pipeline

3. Key Contributions

4. Experimental Results

5. Significance

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models