Symskill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation

Imagine you are teaching a robot to cook a complex meal, like making a sandwich, cleaning up, and then putting the groceries away.

The Problem:
Current robots are like two very different types of students:

The "Parrot" (Imitation Learning): If you show a parrot how to make a sandwich 1,000 times, it can copy you perfectly if the bread is in the exact same spot. But if you move the bread, the parrot freezes. It doesn't understand what it's doing; it just memorizes the muscle movements.
The "Overthinker" (Classical Planning): This robot understands the logic: "I need to open the fridge, get the cheese, close the fridge." But it thinks so slowly that by the time it figures out the plan, the cheese has melted, or a human has moved the fridge. It can't react fast enough to real-world chaos.

The Solution: SymSkill
The paper introduces SymSkill, a new way to train robots that combines the best of both worlds. Think of it as teaching the robot to be a smart apprentice who learns by watching you play, rather than just copying your muscles.

Here is how it works, broken down into simple steps:

1. The "Play" Phase (Learning without a Manual)

Usually, you have to manually label every single move a robot makes ("Now pick up the cup," "Now move to the table"). SymSkill is different. You just let the robot watch you play with objects for about 5 minutes.

The Magic Trick: The robot doesn't just watch the hand movements. It uses a "smart eye" (a Vision Language Model) to figure out what is important.
Analogy: Imagine you are teaching a child to open a door. You don't say, "Move your hand 3 inches left." You just say, "Look at the handle." SymSkill does this automatically. It figures out that the "handle" is the reference point, not the floor or the wall.

2. Inventing the "Vocabulary" (Symbols)

Once the robot has watched you, it invents its own vocabulary (predicates) to describe the world.

Instead of seeing "x-coordinate 45, y-coordinate 12," it learns concepts like "Door Is Open" or "Cup Is On Table."
It groups similar movements together. If you opened a cabinet door three times, it realizes, "Ah, this is the 'Open Cabinet' skill."
Analogy: It's like a child learning that "eating" involves a fork, a plate, and food. They don't need a manual; they just notice the pattern.

3. Learning the "Muscle Memory" (Skills)

For every concept it invents (like "Open Cabinet"), the robot learns a specific, stable movement pattern (a skill).

It learns a "force field" (a mathematical concept called a Dynamical System).
Analogy: Imagine a marble rolling down a bowl. No matter where you drop the marble in the bowl, it always rolls to the bottom. SymSkill teaches the robot to create these "bowl-shaped" paths. If you bump the robot's arm while it's reaching for a cup, the "bowl" gently guides it back to the cup without the robot panicking or stopping to think.

4. The "Brain" vs. The "Reflex" (Online Execution)

This is where SymSkill shines. It splits the robot's brain into two parts:

The High-Level Brain (Symbolic Planner): This part is slow but smart. It looks at the goal ("Put the cheese in the fridge") and decides the order of operations: Open Door -> Pick Cheese -> Close Door.
The Low-Level Reflex (The Skills): This part is fast. It executes the "Open Door" skill. Because of the "bowl" metaphor mentioned above, if a human bumps the door while it's opening, the robot just slides back on track instantly. It doesn't need to stop and re-calculate the whole plan.

Why is this a big deal?

Data Efficiency: It learns complex 12-step tasks from just 5 minutes of play data. Other methods might need hours or thousands of examples.
Real-Time Recovery: If the robot drops a lid or someone moves a chair, it doesn't crash. It instantly re-plans the order of steps (the high-level brain) while keeping the movement smooth (the low-level reflex).
Generalization: Because it learned the logic (Predicates) and not just the movements, it can combine skills it already knows to do new things it has never seen before.

The Bottom Line

SymSkill is like teaching a robot to be a chef instead of a tape recorder.

A tape recorder just plays back the exact same song. If you change the tempo, it breaks.
A chef understands the ingredients and the steps. If you move the stove, the chef doesn't panic; they just adjust their movements and keep cooking.

SymSkill allows robots to learn from short, messy, real-world play sessions and then perform complex, long tasks safely and quickly, even when things go wrong.

Here is a detailed technical summary of the paper "SymSkill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation."

1. Problem Statement

The paper addresses the challenge of enabling robots to perform complex, long-horizon manipulation tasks in dynamic environments. Current approaches face a trade-off:

Imitation Learning (IL): Excellent at reproducing specific skills from large datasets but learns "monolithic" policies that lack compositional generalization. They struggle to decide which skill to reuse when scenes change.
Task-and-Motion Planning (TAMP): Offers compositional generalization by decomposing tasks into symbolic planning and continuous motion. However, it typically relies on hand-engineered symbols/skills (labor-intensive) and suffers from high planning latency (seconds to minutes), making real-time failure recovery impossible in dynamic settings.

The Goal: Develop a framework that learns symbols (predicates/operators) and skills jointly from unlabeled, unsegmented robot demonstrations (as few as 5 per task) to achieve real-time, reactive, and data-efficient long-horizon planning.

2. Methodology: SymSkill Framework

SymSkill is a unified framework that co-invents symbolic abstractions and goal-oriented skills. It operates in two phases: Offline Learning and Online Execution.

A. Offline Learning Pipeline

The system processes raw, unsegmented demonstration trajectories ( $\tau$ ) to learn predicates, operators, and skills.

Segmentation & Reference Frame Selection:
- Trajectories are segmented into Premotion (end-effector moving toward an object) and Motion (end-effector moving an object).
- Motion Object ( $o_{int}$ ): Identified as the object moving during the motion segment.
- Reference Object ( $o_{ref}$ ): For motion segments, a Vision-Language Model (VLM, specifically Gemini-2.5-Pro) is queried on sampled frames to identify a stationary, semantically relevant reference object (e.g., a sink when moving a cup). This step is lightweight and only used offline.
- Trajectories are expressed in relative frames: $o_{int}$ frame for premotion, and $o_{ref}$ frame for motion.
Predicate Learning (Symbolic Abstraction):
- Predicates are defined as relative-pose classifiers.
- Premotion Predicates ( $o_{int}\psi_{ee}$ ): Learned by fitting Gaussian distributions to the end-effector poses relative to the motion object at the start of the motion segment.
- Motion Predicates ( $o_{ref}\psi_{oint}$ ): Learned by fitting Gaussians to the final pose of the manipulated object relative to the reference object.
- These distributions define ellipsoids; a predicate is true if the current state falls within the threshold (Mahalanobis distance) of the learned distribution.
Operator Learning:
- Operators ( $\alpha$ ) are derived by tracking transitions between abstract states (sequences of true predicates) across demonstrations.
- Each operator includes: Preconditions, Add/Delete Effects, Maintenance Conditions (predicates that must hold during execution), and an associated Skill.
- This process invents reusable operators (e.g., "Pick Lid," "Place Lid") without manual labeling.
Skill Learning (SE(3) LPV-DS):
- Instead of standard neural networks, SymSkill learns Dynamical System (DS) policies, specifically SE(3) Linear Parameter Varying (LPV-DS).
- These policies model the motion as a mixture of Linear Time-Invariant (LTI) systems, ensuring global asymptotic stability.
- The policy outputs a 6D end-effector twist, tracked by a passive impedance controller for safety and disturbance rejection.

B. Online Execution & Recovery

Symbolic Planning: Given a symbolic goal (conjunction of predicates), an A* planner composes a sequence of learned operators to reach the goal.
Reactive Execution: The robot executes the low-level DS skills. Because DS policies are stable feedback controllers, they naturally reject continuous perturbations without replanning.
Failure Recovery:
- Motion Level: If an obstacle is detected, the DS policy is modulated locally to avoid it.
- Symbolic Level: If a maintenance condition is violated or an effect is not achieved (e.g., a grasp is lost), the system triggers replanning.
- Resampling: The system can resample the target pose (attractor) from the learned Gaussian distributions to retry the skill with a slightly different target, enabling autonomous recovery from disturbances (e.g., regrasping a dropped object).

3. Key Contributions

Unified Co-Invention Framework: A method to jointly learn predicates, operators, and skills from unlabeled, unsegmented data, requiring as few as 5 demonstrations per task.
Data Efficiency & Generalization: By using relative frames and VLMs only for offline reference selection, the system avoids the heavy data requirements of monolithic IL and the manual engineering of TAMP.
Real-Time Reactivity: The combination of stable DS skills and symbolic-level replanning allows for real-time failure recovery (both motion and symbolic levels), a capability missing in standard TAMP.
Open-Source Implementation: Provides a working implementation in the RoboCasa simulation and on a real Franka Panda robot.

4. Experimental Results

A. Simulation (RoboCasa)

Single-Step Tasks: Achieved an 85% average success rate across 12 tasks (e.g., opening doors, picking and placing) using only 5–10 demonstrations per task.
Comparison:
- Outperformed Diffusion Policy (DP) significantly (85% vs. 3.3% average). DP failed due to data scarcity and lack of stability in the premotion phase.
- Outperformed NSIL (a prior co-invention method) which failed to learn semantically meaningful predicates in low-data, multi-object settings.
Multi-Step Tasks: Successfully composed single-step skills into a 12-step "StoreCheese" task without additional training data.

B. Real-World (Franka Panda Robot)

Learning from Play: Learned 11 operators from approximately 5 minutes of unsegmented play data.
Task Execution: Successfully executed a 12-step task (Open Door $\to$ Pick Cheese $\to$ Place $\to$ Close Door) based on user-specified symbolic goals.
Robustness: Demonstrated recovery from three types of disturbances:
1. Symbolic: Closing a lid unexpectedly triggered replanning.
2. Motion: Moving the target object was handled by the stable DS controller without replanning.
3. Obstacle: Added obstacles were avoided via local modulation.

5. Significance

SymSkill represents a significant step toward generalist robots that can learn complex manipulation tasks quickly and safely.

Bridging the Gap: It effectively bridges the gap between the flexibility of imitation learning and the logical rigor of symbolic planning.
Practicality: By reducing the data requirement to minutes of play and enabling real-time recovery, it moves robotic learning closer to real-world deployment where environments are dynamic and data is scarce.
Safety: The use of passive impedance control and stable dynamical systems ensures safe interaction with humans and fragile objects.

The code and additional analysis are available at the provided project link, facilitating further research in data-efficient robotic learning.