How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

Imagine you are teaching a robot to peel an apple. Sounds simple, right? But for a robot, this is like trying to walk a tightrope while juggling. The robot has to hold a knife, press it against the fruit just hard enough to cut the skin but not the flesh, and follow a bumpy, curved surface that changes every time.

If the robot presses too hard, it cuts the apple. Too soft, and it just scratches the surface. And unlike a game where you just win or lose, "peeling" is subjective: Did the robot leave a nice, even strip? Did it waste too much fruit? Did it look smooth?

This paper, "How to Peel with a Knife," is about teaching a robot to do this tricky job so well that it matches human standards, even on fruits it has never seen before. Here is how they did it, broken down into simple steps.

1. The Setup: The Robot's "Hand" and "Eyes"

The researchers built a robot arm (a Kinova Gen3) and gave it a special "hand" that holds a knife.

The Feel: They attached a force sensor (like a super-sensitive scale) to the wrist. This lets the robot "feel" how hard it's pressing, just like your fingertips do.
The Eyes: They strapped two cameras to the wrist, pointing right at the knife and the fruit. This gives the robot a close-up view of exactly where the blade is touching the skin.

2. Stage One: The "Shadowing" Phase (Learning the Basics)

First, they needed to teach the robot the basics. You can't just tell a robot "peel this"; it doesn't know what that means.

The Method: A human operator used a 3D mouse (like a high-tech joystick) to guide the robot's arm through the peeling motion. The robot watched and recorded the human's movements, the force they used, and what the cameras saw.
The Analogy: Think of this like an apprentice chef watching a master. The apprentice doesn't just memorize the recipe; they watch the master's hand pressure, the angle of the knife, and the speed.
The Result: After watching about 50 to 200 peeling sessions, the robot learned a "Base Policy." It could now peel fruits it had seen before with about 90% success. It was good, but not perfect.

3. Stage Two: The "Critique" Phase (Learning Human Taste)

Here is where the magic happens. The robot was good at removing the skin, but maybe the strips were jagged, or it cut too deep. How do you teach a robot "good taste"?

The Problem: You can't easily write a math equation for "smoothness."
The Solution: The researchers asked humans to grade the robot's peeling jobs. They gave scores based on two things:
1. Quantitative (The Ruler): How thick was the peel? (Too thin? Too thick?)
2. Qualitative (The Artist): Did it look nice? Was it continuous? (This is subjective, like judging a painting).
The Reward Model: They fed these human grades into a computer program (an AI "Critic"). This program learned to predict: "If the robot does X, a human will give it a 9/10. If it does Y, a human will give it a 2/10."
The Fine-Tuning: Now, the robot practiced again, but this time, it listened to the "Critic." If it made a move that the Critic said was "bad," the robot adjusted its behavior to get a higher score. It's like a student studying for a test, not just to pass, but to get an A+.

4. The Superpower: Zero-Shot Generalization

The most impressive part of this paper is Generalization.

The Test: They trained the robot only on cucumbers. Then, they handed it a potato, an apple, a pear, and a daikon radish.
The Result: The robot didn't panic. It figured out how to peel these totally different shapes and textures without any extra training.
The Analogy: Imagine you learn to ride a bicycle. Then, someone hands you a motorcycle. You don't know exactly how to ride it, but because you understand balance, steering, and speed, you can figure it out quickly. The robot learned the principles of peeling, not just the specific shape of a cucumber.

Why This Matters

Most robots are great at picking up boxes (which are all the same shape) but terrible at delicate tasks like cooking or surgery.

The Bottleneck: Usually, robots fail because we can't collect enough data, or we can't define what "success" looks like.
The Breakthrough: This paper shows that if you combine force sensing (feeling), human demonstration (watching), and human preference (grading), you can teach robots to do delicate, messy, real-world tasks with very little data.

In a Nutshell

The researchers taught a robot to peel fruit by:

Letting a human show it the ropes (Teleoperation).
Giving the robot "feel" and "sight" (Sensors).
Asking humans to grade the results and teaching the robot to chase those high grades (Preference Learning).

The result? A robot that can peel a potato, an apple, or a cucumber with the precision of a skilled chef, proving that robots can finally handle the messy, "fuzzy" tasks of the real world.

1. Problem Statement

The paper addresses the challenge of autonomous robots performing contact-rich, force-sensitive manipulation tasks where success is defined by continuous, subjective, and "implicit" criteria rather than binary outcomes.

Specific Task: Peeling produce (e.g., apples, potatoes, cucumbers) with a knife.
Key Challenges:
- Dynamics: Requires precise force regulation to maintain unstable blade-surface contact without slipping or cutting too deep.
- Generalization: Must handle diverse geometries, sizes, textures, and stiffness of natural produce.
- Evaluation: Success is not just "peeled vs. not peeled." It involves qualitative metrics like peel thickness, continuity, smoothness, and efficiency, which are difficult to quantify mathematically and hard to align with human preferences.
- Data Scarcity: Collecting high-quality demonstration data for such delicate tasks is expensive and difficult to scale.

2. Methodology

The authors propose a two-stage learning framework that combines efficient data collection, force-aware imitation learning, and preference-based fine-tuning.

A. System Design & Hardware

Robot: 7-DoF Kinova Gen3 arm with impedance control.
End-Effector: Custom mount holding a knife, an ATI mini45 Force-Torque (F/T) sensor, and two RealSense D405 wrist cameras.
Control: Low-level impedance control runs at 500Hz; high-level commands at 10Hz.

B. Stage 1: Data Collection & Base Policy Training

Data Collection: Uses SpaceMouse teleoperation (3Dconnexion) to control the 6-DoF end-effector. This method proved superior to VR teleoperation or kinesthetic teaching for this specific task due to better precision and stability.
Preprocessing:
- Force data is standardized (subtracting initial mean).
- Vision data uses SAM2 for online segmentation of the knife and object.
- Proprioception is represented as delta end-effector pose (relative trajectory) to enable generalization to different base positions.
Base Policy: Trained using Diffusion Policies.
- Inputs: Visual features (grayscale RGB + depth + masks) and Force-Torque readings.
- Architecture: ResNet-18 for vision, MLP for force, and a Diffusion Policy with an MLP denoiser.
- Goal: Learn a robust initial policy that generalizes across object variations (zero-shot) with a baseline success rate of ~60–80%.

C. Stage 2: Preference-Based Fine-Tuning

To align the policy with human notions of quality, the authors introduce a reward model and a residual policy.

Reward Design (Hybrid):
- Quantitative: Measures local peel thickness (categorized into 6 discrete levels, e.g., "too thin," "nominal," "too thick").
- Qualitative: A holistic score (0–9 Likert scale) based on visual appearance (continuity, smoothness, defects).
Reward Model ( $r_\psi$ ): A 3-layer MLP trained offline to predict human preference scores from state-action pairs. It outputs both a scalar reward and a hidden representation ( $h_t$ ) capturing structured preference features.
Residual Policy ( $\pi_{res}$ ):
- The base policy ( $\pi_{base}$ ) is frozen.
- A residual MLP predicts action corrections ( $a_{res}$ ) conditioned on the base latent features, base action, and the reward model's hidden representation.
- Final Action: $a_{final} = a_{base} + a_{res}$ .
Training Objective: Reward-Weighted Behavioral Cloning. The loss function encourages the residual to match the difference between the dataset action and the base action, weighted by an exponential function of the predicted reward ( $w_t = \exp(\beta r_t)$ ). This emphasizes high-quality corrections.

3. Key Contributions

Two-Stage Learning Framework: A pipeline integrating compliant data collection, force-aware imitation learning, and preference-based fine-tuning, specifically designed for fine-grained manipulation.
Preference-Based Reward Model: Demonstrates how to define human preference via a hybrid of quantitative (thickness) and qualitative (visual) metrics, learning a reward model to drive significant policy improvements without new expert demonstrations.
Data-Efficient Generalization: Shows that policies trained on a single produce type (e.g., cucumber) can zero-shot generalize to unseen instances and out-of-distribution categories (e.g., zucchini, pear, daikon) while maintaining >90% success rates.
Systematic Evaluation: Provides a comprehensive benchmark for knife-based peeling, including ablation studies on data modalities, camera placement, and reward design.

4. Experimental Results

Success Rates:
- Seen Produce: 100% success rate on training data (cucumber, apple, potato).
- Unseen Produce: High zero-shot generalization (e.g., Cucumber policy on Zucchini: 50%; Apple on Pear: 90%; Potato on Daikon: 80%).
- Overall: >90% average success rate across diverse produce.
Performance Improvement: Preference-based fine-tuning improved performance by up to 40% compared to the base policy alone.
Sample Efficiency:
- Achieved high performance with only 50–200 trajectories (approx. 8–33 fruits).
- Cucumber policy reached 100% success with 50 trajectories (avg. score 8.5).
Ablation Studies:
- Data Collection: SpaceMouse teleoperation yielded the highest quality data (100% success, avg score 8.5) compared to VR or kinesthetic teaching.
- Modalities: Combining Grayscale RGB, Depth, and Force was critical. Using raw RGB hurt generalization; Force was essential for contact tasks.
- Cameras: Using two wrist cameras (one "before" and one "after" the cut) outperformed single-camera setups. The "before" camera was particularly important due to less occlusion.
- Fine-tuning Strategy: The proposed Residual + Reward-Weighted BC approach significantly outperformed baselines like IQL (Offline RL), fine-tuning from scratch, or using only quantitative/qualitative rewards.

5. Significance

Bridging the Gap: This work moves beyond binary success metrics in robotics, tackling tasks where "quality" is subjective and continuous. It proves that robots can learn to align with human aesthetic and functional preferences in complex physical interactions.
Scalability: By demonstrating that high-quality policies can be learned from a small amount of data (50–200 trajectories) and generalize to unseen objects, the framework offers a practical path toward general-purpose manipulation systems.
Real-World Applicability: The system operates on real hardware with real produce, handling the non-linearities and deformations of biological objects, which are often simplified in simulation.
Future Direction: The paper highlights the potential for mixed-autonomy data collection and online reinforcement learning to further reduce reliance on human teleoperation, while also addressing the environmental cost of food waste in research through surrogate produce or simulation.

In summary, the paper presents a robust solution for one of the most difficult manipulation tasks—knife peeling—by combining force-aware control with a novel human-preference alignment strategy, achieving high precision and generalization with minimal data.