Neuro-Symbolic Skill Discovery for Conditional Multi-Level Planning

Imagine you are trying to teach a robot to make a sandwich. If you just show the robot a video of a human making a sandwich, the robot gets overwhelmed. It sees thousands of tiny muscle movements, the exact angle of the hand, the speed of the slice, and the texture of the bread. It's like trying to learn to drive a car by memorizing the exact pressure on every pedal and the rotation of every wheel bolt. It's too much data, and the robot can't figure out the "big picture" rules.

This paper proposes a clever solution: Teach the robot to think in "chapters" instead of "letters."

Here is how their system works, broken down into simple concepts:

1. The Problem: The "Letter" vs. The "Word"

Most robots are great at moving their arms (the "letters"), but terrible at planning a whole day (the "words" and "sentences").

Low-level data: The messy, continuous movement of a robot arm picking up a tomato.
High-level logic: The concept of "Pick up the tomato."

The authors want to bridge this gap. They want the robot to look at a messy pile of robot movement videos (with no labels, no instructions, just raw footage) and figure out: "Oh, this chunk of movement is 'opening a drawer,' and that chunk is 'pouring coffee.'"

2. The Solution: The "Neuro-Symbolic" Chef

The authors built a system that acts like a smart sous-chef who learns by watching.

Step 1: The Pattern Finder (Neural Network)
Imagine you show the robot 100 videos of someone opening a drawer. Sometimes they pull it fast, sometimes slow, sometimes from the left, sometimes from the right. The robot's "brain" (a neural network) looks at all these messy videos and realizes, "Wait a minute, even though the movements are different, they all end up with the drawer open."

The robot groups these similar movements together and gives them a secret code (a "symbol"). It doesn't know the word "drawer" yet; it just knows that "Code A" means "Open Drawer." It does this for every skill it sees, turning messy videos into a neat list of abstract codes.
Step 2: The Translator (The AI Chatbot)
Now the robot has a list of codes (Code A, Code B, Code C), but a human planner doesn't know what they mean. So, the system takes a snapshot of what the robot did when it used "Code A" and shows it to a super-smart AI (like a large language model).

The AI looks at the picture and says, "Ah, I see! Code A is 'Pick up the tomato' and Code B is 'Pour the water'."
Suddenly, the robot has a dictionary. It can now talk to a high-level planner using human language.
Step 3: The Master Planner (The Brain)
Now, you give the robot a big goal: "Make me a salad."
The high-level planner (the "Brain") uses the dictionary to create a to-do list:
1. Pick up the tomato.
2. Pick up the lettuce.
3. Put them in the bowl.
This is the High-Level Plan. It's simple and logical.
Step 4: The Muscle Memory (The Execution)
Here is the magic trick. When the robot needs to actually do step 1 ("Pick up the tomato"), it doesn't just guess. It remembers the specific "Code A" it learned earlier. But since the tomato is in a slightly different spot than in the training videos, the robot uses a math trick (gradient-based planning) to tweak the movement slightly.

It's like a pianist who knows the song "Chopsticks." If the piano is moved to a different room, the pianist doesn't relearn the song; they just adjust their hand position slightly to hit the right keys. The robot adjusts its "Code A" to fit the new tomato location.

3. Why This is a Big Deal

Usually, to teach a robot a new trick, you need a human to record hundreds of perfect examples and label every single one ("This is picking," "This is placing"). That takes forever.

This system is special because:

It learns from "messy" data: It can look at a few unlabelled videos and figure out the skills on its own.
It works in new places: If you teach it to pick up a cup from the left side of the table, it can figure out how to pick it up from the right side without needing new training.
It handles long tasks: It can plan complex sequences (like "Make coffee, then wash the cup, then dry it") by chaining these simple "codes" together.

The Analogy: Learning a Language

Think of the robot's raw movements as sounds (like a baby babbling).

Old way: You have to teach the robot every specific sound for every specific situation.
This paper's way: The robot listens to the babbling, figures out that certain sounds form "words" (Skills), and then uses a dictionary (the AI) to translate those words into a sentence (the Plan). Once it knows the words, it can speak new sentences it has never heard before.

The Result

In their tests, this robot could go into a kitchen it had never seen before, look at a cluttered counter, and successfully make coffee or load a dishwasher, even if the cups and plates were in weird spots. It did this by only watching a few examples of each action first.

In short: They taught the robot to stop memorizing every single muscle twitch and start thinking in "actions," allowing it to be a flexible, smart helper rather than a rigid, pre-programmed machine.

1. Problem Statement

The paper addresses the challenge of long-horizon planning in continuous state-action spaces for robotics. Key difficulties include:

Data Scarcity & Noise: Real-world robotic data is complex, noisy, and limited compared to the vast text data used to train Large Language Models (LLMs).
The Abstraction Gap: LLMs possess high-level reasoning capabilities but lack direct grounding in low-level robotic control. Conversely, low-level controllers cannot easily generalize to new environments or long-horizon tasks without high-level semantic guidance.
Unlabeled Data: Existing methods often require pre-defined predicates or labeled demonstrations. There is a need for a system that can learn discrete high-level skills from unlabeled, low-level trajectory demonstrations and bridge them to symbolic planning.

2. Methodology

The authors propose a Neuro-Symbolic Learning Architecture that integrates neural networks for skill discovery with symbolic planning. The pipeline consists of five main stages:

A. Skill Discovery (Unsupervised Clustering)

Architecture: The core model is a Conditional Neural Movement Primitives (CNMP) network enhanced with Vector Quantization (VQ).
Input: Unlabeled sensorimotor trajectories (e.g., end-effector poses).
Mechanism:
1. The encoder processes sampled points from a trajectory to generate latent vectors ( $z_i$ ).
2. These vectors are averaged to form a single trajectory representation ( $z_e$ ).
3. A Vector Quantization bottleneck maps $z_e$ to the nearest discrete vector ( $v_k$ ) in a learned "skill space."
4. The decoder reconstructs the trajectory at a target time step ( $t_{target}$ ) using the quantized vector and the time step.
Goal: To cluster continuous variations of the same high-level skill (e.g., picking up a tomato from different locations) into a single discrete latent vector, effectively discovering the skill structure without labels.

B. Self-Supervised Refinement

Once the initial VQ model clusters the demonstrations, the system assigns pseudo-labels based on the discovered clusters.
A new model is trained from scratch using these pseudo-labels. Instead of relying on Euclidean distance for quantization during inference, the model is forced to map specific demonstrations to specific vectors.
Result: This significantly improves the reliability of the low-level trajectory generation compared to the purely unsupervised phase.

C. Symbolic Labeling via Multi-Modal LLMs

Since the discovered vectors are initially unlabeled, the system uses Vision-Language Models (VLMs) (specifically GPT-4.1 and Gemini) to interpret them.
Process: The system generates trajectory snapshots using the discovered vectors. These images are fed to the LLM along with environmental context to identify the action (e.g., "Pick," "Place," "Pour").
Output: A mapping between the discrete latent vectors and human-readable symbolic action names.

D. Bi-Level Planning Pipeline

High-Level Planning: An LLM agent receives the environment state (images + object locations detected by DINO) and the list of discovered symbolic skills. It generates a sequence of actions (a plan) to achieve a goal.
Low-Level Planning (Gradient-Based):
- For each step in the high-level plan, the corresponding skill vector is retrieved.
- The system performs gradient-based optimization on the input vector to the decoder.
- Objective: Minimize the Mean Squared Error (MSE) between the predicted object location (at contact time) and the actual object location in the environment.
- This adjusts the low-level trajectory to fit the specific object position without retraining the model.

3. Key Contributions

Novel Skill Discovery: A conditional neural architecture capable of learning discrete high-level skill representations from unlabeled low-level trajectories using vector quantization.
Bi-Level Planning System: A complete pipeline that allows an LLM to reason about high-level goals while a gradient-based neuro-symbolic controller handles low-level execution and generalization to new object locations.
Self-Supervised Improvement: A method to refine the initial unsupervised clustering into a robust low-level planner by retraining with discovered pseudo-labels.
VLM Integration: Demonstrating the use of Multi-Modal LLMs to automatically label discovered skills and generate high-level plans, bridging the gap between text-based reasoning and robotic embodiment.

4. Experimental Results

The system was evaluated in both simulated kitchen environments and real-world settings (UR-10 and XArm 7 robots).

Skill Discovery:
- The model successfully clustered demonstrations into discrete skills.
- Accuracy: When the skill space size matched the number of true skills, perfect clustering was achieved in 27/100 trials. The system could identify the optimal model by selecting those with the lowest loss.
- Over/Under-clustering: The model showed robustness; even with larger vector spaces, it distributed skills in a way that minimized quantization loss, allowing for post-hoc identification of redundant vectors.
LLM Labeling:
- LLMs (Gemini and GPT) successfully labeled actions.
- Performance: "Pick" actions were labeled with high accuracy (~~60-65%), while "Place" actions were harder to distinguish (~~3-15%), likely due to visual similarity in snapshots.
Planning Performance:
- High-Level: LLMs achieved 86.10% success on long-horizon tasks (>7 actions) when using DINO for object detection (DINOGemini). Annotation quality was critical; poor object detection degraded performance significantly.
- Low-Level:
  - Self-Supervised vs. Unsupervised: The self-supervised model vastly outperformed the unsupervised baseline.
  - Generalization: With 2+ demonstrations per skill, the self-supervised model achieved ~90-100% success in single-task and multi-task scenarios (e.g., loading dishwashers, making coffee, watering plants) even in unseen object locations.
  - Unsupervised models struggled significantly, often failing to generalize without the self-supervised refinement.

5. Significance and Future Work

Significance: This work demonstrates a viable path toward generalist robotic agents that do not require massive amounts of labeled robotic data. By using LLMs for high-level reasoning and neuro-symbolic methods for low-level grounding, the system can learn complex, long-horizon tasks from very few demonstrations.
Limitations:
- Reliance on external agents (LLMs) for reasoning means planning performance is tied to the LLM's perception capabilities.
- Demonstrations must follow a consistent distribution to be clustered correctly.
- Requires expert-level demonstrations as input.
Future Directions: Incorporating effect predicates (e.g., "object is now open") into the model to enable error detection and more robust long-horizon planning.

In summary, the paper presents a robust framework for learning symbolic skills from raw data and executing them in dynamic environments, effectively solving the "what to do" (LLM) and "how to do it" (Gradient-based Neuro-Symbolic) problem in robotics.