On Sample-Efficient Generalized Planning via Learned Transition Models

Imagine you are teaching a robot to solve a puzzle, like stacking blocks or moving packages.

The Old Way (The "Guess the Next Move" Robot)
Most modern AI planners work like a student who has memorized a specific recipe. They look at the current situation and try to guess the very next move based on patterns they've seen before.

The Problem: If you ask this robot to solve a puzzle with 10 blocks, it might do great. But if you suddenly give it a puzzle with 100 blocks (something it's never seen), it gets confused. It tries to guess the next move, but because it doesn't truly understand how the world changes, it starts hallucinating. It might think a block is floating in mid-air or that a truck is driving through a wall. This is called "state drift." It's like a storyteller who forgets the plot after a few chapters and starts making things up that don't make sense.
The Cost: To get good at this, these robots need to read millions of stories (training data) and have huge brains (massive computer models).

The New Way (The "Physics Teacher" Robot)
This paper introduces a smarter approach. Instead of guessing the next move, the AI learns the rules of the game (the physics of the world).

Think of it like this:

The Old Robot is a parrot. It repeats "Pick up block, put down block" because it heard it before.
The New Robot is a physics teacher. It learns: "If I pick up a block, the block is no longer on the table, and my hand is no longer empty."

How It Works (The "Map and Compass" Analogy)
The authors built a system that works in three simple steps:

The Map (State Representation):
Instead of listing every single object by name (which gets messy when you have 100 blocks), the AI uses a special "fingerprint" for the whole scene. Imagine taking a photo of the puzzle and turning it into a simple code that describes the structure of the scene, regardless of how many pieces are in it. This allows the AI to understand a puzzle with 5 blocks and a puzzle with 500 blocks using the same "language."
The Compass (Learning the Transition Model):
The AI learns a "transition model." This is like a simulator that predicts: "If I am in this state and I want to reach that goal, what will the world look like after I take a step?"
- It doesn't guess the action; it guesses the result.
- It uses a "delta" approach: It only learns what changes. If 99% of the blocks stay still, the AI ignores them and only focuses on the 1 block that moved. This makes it incredibly efficient.
The Safety Check (Neuro-Symbolic Decoding):
Here is the magic trick. The AI predicts the future state (e.g., "The block will be on the table"). But before the robot actually moves, it checks its official rulebook (the symbolic planner).
- It asks: "Is there a legal move in the real world that results in exactly this future state?"
- If yes, it executes that move.
- If no, it corrects itself immediately.
  This ensures the robot never breaks the laws of physics, even if its prediction was slightly off.

Why This Is a Big Deal

Small Brain, Big Results: The new method uses a tiny model (about 1 million parameters) compared to the massive "Giant Brain" models (200+ million parameters) used by competitors.
Less Data Needed: It learns from a handful of examples (like 9 puzzles) instead of needing millions.
Better at the Unknown: When tested on puzzles much larger than anything it was trained on, this "Physics Teacher" robot succeeded where the "Parrot" robots failed completely.

The One Weakness
The paper admits that while this works great for simple, local puzzles (like stacking blocks), it struggles with extremely complex, multi-layered problems (like a massive logistics network with trucks, planes, and cities) where a single step doesn't tell the whole story. But for most everyday planning tasks, it's a huge leap forward.

In a Nutshell
Instead of teaching an AI to memorize a script, this paper teaches it to understand the physics of the world. By predicting what happens next rather than what to do next, and then double-checking that prediction against the rules, the AI becomes smarter, faster, and capable of solving problems it has never seen before.

Here is a detailed technical summary of the paper "On Sample-Efficient Generalized Planning via Learned Transition Models."

1. Problem Definition

Generalized Planning (GP) aims to construct solution strategies that solve families of planning problems sharing a common domain model but varying in the number of objects (size) and specific configurations.

Formal Setup: A planning task is defined over a state-transition system $\Sigma = \langle S, A, \gamma \rangle$ , where $\gamma: S \times A \to S$ is the transition function. The goal is to find a policy that works for instances with varying object counts ( $|O|$ ), specifically generalizing from small training instances to much larger test instances (extrapolation).
Limitations of Current Approaches: Recent Transformer-based planners (e.g., PlanGPT, Plansformer) treat GP as action-centric sequence prediction ( $p(\pi | \Pi)$ $p (π ∣Π)$ ). They directly predict action tokens without explicitly modeling the world state evolution.
- Drawbacks: These methods suffer from state drift in long-horizon tasks, require massive datasets and model sizes (hundreds of millions of parameters), and often fail to generalize to out-of-distribution (OOD) instances with larger object counts because they lack explicit state tracking.

2. Methodology: State-Centric Generalized Planning

The authors propose a state-centric formulation where the model learns the transition dynamics explicitly rather than predicting actions directly.

A. Core Formulation

Instead of predicting the next action $a_t$ , the model learns a goal-conditioned transition model $T_\theta$ that predicts the successor state $\hat{s}_{t+1}$ given the current state $s_t$ and goal $g$ .

Plan Generation: Plans are generated by "rolling out" the predicted state trajectory. At each step, the predicted successor embedding is matched against all valid symbolic successors (generated by the domain's operators) to recover the executable action.
Neuro-Symbolic Decoding: This ensures that every step in the generated plan is symbolically valid, correcting neural prediction errors via a nearest-neighbor search over the set of valid successors $Succ(s_t)$ .

B. Size-Invariant State Representations

To handle variable numbers of objects, the authors evaluate two state representations:

Fixed-Size Factored (FSF): Represents states as fixed-dimensional vectors with pre-assigned object slots. This fails to generalize to larger object counts because the slot mapping is rigid.
Weisfeiler-Leman (WL) Graph Embeddings: The primary contribution for representation.
- States and goals are encoded as relational instance graphs.
- k-iteration WL color refinement is applied to generate node color histograms.
- Properties: The resulting embedding $\phi(s, g) \in \mathbb{R}^D$ is permutation-invariant (order of objects doesn't matter) and size-invariant (dimension $D$ depends only on the domain structure, not the number of objects). This allows models trained on small instances to process larger ones.

C. Transition Model Learning

The authors compare two model classes to predict state updates in the embedding space:

Parametric (LSTM): A 2-layer LSTM that learns sequential dependencies.
Non-Parametric (XGBoost): A tree-based regressor that approximates the transition kernel locally.

Residual Formulation:
To exploit the sparsity of STRIPS domains (where most predicates remain unchanged), the models predict a delta vector ( $\Delta_t$ ) rather than the full state:
$\hat{\phi}(s_{t+1}) = \phi(s_t) + f_\theta(\phi(s_t), \phi(g))$
This explicitly encodes frame axioms and improves sample efficiency.

3. Key Contributions

Transition-Model Formulation: A novel formulation of GP as a transition-model learning problem ( $\hat{\gamma} \approx \gamma$ ) rather than direct action prediction, enforcing explicit world-state evolution.
Systematic Evaluation of Representations: A rigorous comparison showing that WL graph embeddings are critical for size-invariant generalization, whereas fixed-slot encodings fail under extrapolation.
Efficiency and Performance: Demonstration that compact models (LSTM $\approx$ 1M params, XGBoost $\approx$ 115K nodes) trained on small, unaugmented datasets can match or exceed the extrapolation performance of massive Transformer baselines (25M–220M params) that rely on heavy data augmentation.

4. Experimental Results

The authors evaluated on four IPC benchmark domains: Blocksworld, Gripper, Logistics, and VisitAll.

Extrapolation Performance (OOD):
- Action-Centric Baselines (Plansformer, PlanGPT, SymT): Achieved 0.00 success on strict extrapolation in most domains (except limited success in Blocksworld/Gripper/VisitAll for SymT). They failed completely on Logistics.
- State-Centric Models (WL-based):
  - Blocksworld: WL-XGB (delta) achieved 45% success vs. 13% for SymT.
  - VisitAll: WL-XGB (delta) achieved 87% success vs. 64% for SymT.
  - Gripper: SymT performed better (79% vs 25%), suggesting sequential memory (LSTM) or specific domain dynamics play a role here.
  - Logistics: All learned models (including state-centric) failed (0.00%), highlighting a limitation in domains with deep hierarchical causal coupling.
Sample Efficiency:
- The proposed method achieved strong results using orders of magnitude fewer parameters and no data augmentation (e.g., trained on only 9 instances for Blocksworld), whereas SymT required symmetry-based state-space expansion.
Residual Modeling:
- Delta-prediction significantly improved performance in sparse domains (Blocksworld, VisitAll) for tree-based models, confirming that learning state changes is more efficient than learning full states.

5. Significance and Conclusion

Inductive Bias over Scale: The paper argues that explicit transition modeling combined with size-invariant relational representations provides a stronger inductive bias for generalization than simply scaling up model size or data volume.
Robustness: The neuro-symbolic decoding interface guarantees symbolic validity at every step, mitigating the state drift common in autoregressive sequence models.
Limitations: The approach struggles in domains with complex, multi-layer hierarchical dependencies (like Logistics) where one-step transition prediction is insufficient.
Future Work: Extending the framework to multi-step or abstract transitions to handle hierarchical domains while preserving symbolic verification.

Code Availability: The implementation is open-source at https://github.com/ai4society/state-centric-gen-planning.

On Sample-Efficient Generalized Planning via Learned Transition Models

1. Problem Definition

2. Methodology: State-Centric Generalized Planning

A. Core Formulation

B. Size-Invariant State Representations

C. Transition Model Learning

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers