See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming

Imagine you are teaching a robot to make a sandwich.

In the old days, you had to record the robot's hand moving exactly from the fridge to the counter, then to the bread, then to the knife. If you did this once, the robot would memorize that exact path. But what if you moved the bread to a different shelf? Or what if the fridge door was closed? The robot would try to walk through the closed door or grab the air where the bread used to be, fail, and stop. It was like a broken record player stuck on one song.

This paper introduces a new way to teach robots called "See & Switch." Think of it as giving the robot a smart GPS instead of a fixed map.

The Core Idea: The "Decision Tree"

Instead of one long, rigid path, the robot learns a branching tree of actions.

The Trunk: The robot learns the basic steps (e.g., "Go to the kitchen").
The Branches: At certain points, the robot stops and asks, "What do I see?"
- Scenario A: "I see the bread on the counter." -> Take the "Grab Bread" branch.
- Scenario B: "I see the bread is inside a closed box." -> Take the "Open Box" branch.
- Scenario C: "I see nothing familiar." -> Sound an alarm and ask the human for help.

The Magic Ingredient: The "Switcher"

The paper's main invention is a component called the Switcher.

The Eyes: The robot has a camera on its hand (like a human looking at what they are holding).
The Brain: The Switcher is a smart AI that looks at the camera image at those "branching points."
The Choice: It doesn't just guess; it compares what it sees against a library of things it has learned.
- If it sees a familiar situation, it instantly picks the correct branch to continue the task.
- If it sees something weird (like a door that was never there before), it flags it as an Anomaly.

The "Teaching" Process: No Code Required

The coolest part is how you teach the robot to handle these new situations. You don't write code. You just show it.

The Glitch: The robot tries to do the task, sees a closed door, and stops because it doesn't know what to do.
The Rescue: You (the human) step in. You can:
- Physically guide the robot's arm (Kinesthetic teaching).
- Use a joystick.
- Wave your hands in the air (Gestures).
The Lesson: You show the robot how to open the door. The system automatically adds this new "branch" to its tree. Next time, if it sees a closed door, it will know to open it first.

Why This Matters (The "Aha!" Moment)

The researchers tested this with regular people (non-experts) teaching a robot three tricky tasks:

Picking up a peg.
Measuring voltage with a probe (sometimes behind a door).
Wrapping a cable.

The Results:

It Works: The robot successfully chose the right path 90% of the time, even when the environment changed.
It's Flexible: It didn't matter how the human taught it (hand-guiding, joystick, or gestures). The system understood all of them.
It's Safe: If the robot got confused, it didn't just crash; it stopped and waited for a human to show it the new way.

The Analogy: Learning to Drive

Old Way (Fixed Replay): You memorize the route to work. If there is road construction, you get stuck because you don't know how to detour.
New Way (See & Switch): You learn the rules of the road. When you see a "Road Closed" sign (the Anomaly), you know to look for a detour. If you've never seen that specific detour, you call a friend (the Human) to show you the way. Once you've been shown, you remember it for next time.

In a Nutshell

This paper solves the problem of robots being "brittle" (easily broken by small changes). By giving them a visual brain that can make decisions and a flexible memory that grows as you teach it new tricks, we can finally have robots that work in the messy, unpredictable real world, not just in a perfect lab.

Here is a detailed technical summary of the paper "See & Switch: Vision-Based Branching for Interactive Robot-Skill Programming."

1. Problem Statement

Programming by Demonstration (PbD) allows non-experts to teach robots tasks through interaction rather than coding. However, a major limitation of current PbD frameworks is their inability to handle environmental variability. Traditional systems replay a single fixed trajectory, which fails if the environment changes (e.g., a door is closed, an object is missing, or an obstacle appears).

Existing solutions often rely on:

Manual branching: Requiring the user to explicitly define logic.
Low-dimensional signals: Using proprioception (force/torque) which is limited to contact states and lacks broader environmental context.
Static task graphs: Unable to adapt to Out-of-Distribution (OOD) scenarios without complete re-teaching.

The core challenge addressed is enabling conditional task execution where the robot can autonomously select the correct skill variant (branch) based on high-dimensional visual observations and detect anomalies requiring new demonstrations.

2. Methodology

The authors propose See & Switch, an interactive framework that represents tasks as conditional task graphs composed of "skill parts" connected by Decision States (DS).

A. System Architecture

Task Graph Representation:
- Tasks are defined as graphs where nodes are skill parts (trajectory segments) and edges represent transitions.
- Decision States (DS): Specific time windows during execution where the system pauses to evaluate the environment and select the next skill part.
- Incremental Refinement: The graph grows online. If an anomaly is detected, a new branch (skill variant) can be added via user demonstration.
Modality-Agnostic Input Layer:
- The system supports three teaching modalities unified under a common API:
  - Kinesthetic Teaching: Physical guidance with gravity compensation.
  - Joystick Control: Teleoperation via a controller.
  - Hand Gestures: 6-DoF hand teleoperation.
- This allows users to provide corrective demonstrations in-situ (during execution) regardless of the input method.
The "Switcher" (Vision-Based State Evaluator):
- The core innovation is a vision-based classifier trained on eye-in-hand camera images.
- Dual Function:
  1. Branch Selection: Classifies the current visual state to select the correct successor skill part from a set of permitted options.
  2. Anomaly Detection: Identifies if the current observation is Out-of-Distribution (OOD) (i.e., a novel situation not covered by existing branches).
- Implementation:
  - Uses DINO (Self-Supervised Vision Transformers) as a frozen feature extractor.
  - Employs DINOv2/DINOv3 backbones.
  - Uses specific heads for classification: Prototype inference (Mean), Multi-Instance Learning (MIL), or Attention-Gated features (focusing on informative image patches).
  - Anomaly detection is performed by measuring similarity to the training distribution; low similarity triggers an anomaly flag.

B. Execution Workflow

Initialization: User teaches an initial task, creating the root skill part ( $s_0$ ).
Execution & Monitoring: The robot replays the trajectory. At each DS window, the Switcher analyzes the eye-in-hand image.
Decision Logic:
- Normal: If the image matches a known class, the system switches to the corresponding successor skill part.
- Anomaly: If the image is OOD, the system pauses.
Recovery/Refinement:
- The user is prompted to demonstrate a new skill part for the new context.
- The system creates a new branch in the task graph and retrains the Switcher locally for that DS.

3. Key Contributions

Vision-Grounded Conditional Branching: Replaces low-level proprioceptive signals with high-dimensional eye-in-hand vision for robust branch selection and anomaly detection.
Incremental Task Graph Expansion: A mechanism to add new skill variants online via user feedback without restructuring the existing graph, supporting open-ended learning.
Modality-Agnostic Teaching: A unified interface allowing recovery demonstrations via kinesthetic, joystick, or gesture inputs, making the system accessible to diverse users.
DINO-Based Switcher: A novel application of self-supervised vision transformers for decision-state-specific classification and OOD detection, outperforming traditional feature matching (SIFT/ORB) and Autoencoder baselines.

4. Experimental Results

The system was validated on a Franka Panda robot with an eye-in-hand camera across three dexterous manipulation tasks: Peg Pick, Probe Measure, and Cable Wrap.

A. Controlled Experiment (Scalability)

Tested the Switcher's reliability as the number of competing branches increased (from 2 to 8 classes).
Result: The DINOv2 small with attention-gated features was the most robust, maintaining >90% accuracy even with 5-6 classes, whereas SIFT and Autoencoder baselines degraded significantly.

B. User Study (8 Novice Participants)

Dataset: 192 demonstrations and 576 execution roll-outs.
Branch Selection Accuracy:
- Achieved 90.7% accuracy when filtering out cases with poor observability (e.g., camera couldn't see the critical feature).
- Overall accuracy was 81.7% (with dinov2 small concat).
Anomaly Detection Accuracy:
- Achieved 87.9% accuracy in detecting OOD contexts.
Teaching Efficiency:
- Kinesthetic teaching was the fastest (avg. 19.5–24.2s) and most robust across tasks.
- Joystick and Gestures were significantly slower for orientation-sensitive tasks (up to 73.5s for Cable wrap) but remained viable alternatives.
Task Success:
- Peg Pick: Highest success rate (up to 99.2% with joystick).
- Cable Wrap: Most challenging (48.8–58.1% success), primarily due to slip grasps rather than switching errors.

5. Significance and Conclusion

See & Switch addresses a critical gap in robot learning by enabling conditional, adaptive execution driven by visual perception. Its significance lies in:

Robustness: Moving beyond rigid trajectory replay to handle real-world variability (e.g., hidden objects, open/closed doors).
Accessibility: Allowing non-experts to incrementally teach complex conditional logic without coding, using natural interaction modalities.
Efficiency: The ability to recover from failures in-situ by adding specific branches rather than re-teaching the whole task.

Limitations: The system relies heavily on observability; if the eye-in-hand camera cannot see the discriminative feature (e.g., a door behind the robot), branch selection fails. Future work suggests multi-view sensing or viewpoint control to mitigate this.

In summary, the paper demonstrates that combining conditional task graphs with vision-based decision mechanisms creates a scalable and user-friendly framework for programming robots to handle complex, variable environments.