HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild

Imagine a strawberry farm. It's not a neat, factory-like assembly line; it's a messy, living jungle. The strawberries are hidden under leaves, the sunlight glares off wet fruit making it hard to see, and the berries are so delicate that a tiny squeeze can turn them into mush.

For decades, robots have struggled here. Traditional robots are like rigid, rule-following accountants. They need a perfect map, exact measurements, and a clear line of sight. If a leaf blocks the view or the light changes, the accountant robot freezes or breaks the fruit.

This paper introduces HarvestFlex, a new approach that treats the robot more like a skilled, intuitive human picker. Instead of being programmed with rigid rules, the robot learns by watching a human do the job through a VR headset, then tries to copy that "feel" and intuition.

Here is the breakdown of how they did it, using simple analogies:

1. The "Three-Eyed" Robot

Traditional robots often rely on 3D depth sensors (like a laser scanner) to see the world. HarvestFlex decided to skip the lasers and just use three regular cameras (RGB), similar to how humans use eyes.

Two "Scene" Eyes: These are fixed cameras looking at the whole table, giving the robot a wide view to find the strawberries (like looking at a map).
One "Wrist" Eye: This camera is attached to the robot's hand. It zooms in on the specific berry, helping the robot see exactly how to grab it without squishing it (like looking through a magnifying glass).
The Trick: They didn't use complex 3D math to calibrate these cameras. They just let the robot learn from the pictures, much like a baby learns to grab a toy by looking at it, not by calculating the distance in meters.

2. The "VR Teacher"

How do you teach a robot to pick a strawberry without breaking it? You don't write code for every possible leaf position. Instead, you show it.

The researchers used a VR headset (like a Meta Quest) to let a human operator "drive" the robot remotely.
The human wore the headset, saw what the robot saw, and used hand controllers to gently pick strawberries.
The robot recorded 3.7 hours of this "teleoperation" (about 227 picking sessions). It's like the robot watched a master chef cook a meal 200 times and then tried to cook it themselves.

3. The "Brain" (VLA Policy)

The robot uses a special type of AI called a Vision-Language-Action (VLA) model.

Vision: It sees the strawberry.
Language: It understands a simple command like, "Pick all the ripe strawberries and put them in the tray."
Action: It doesn't just say "I see a berry." It immediately decides, "I need to move my arm left, then gently suck the berry, then twist it off."
Think of it as a translator that turns "I see a red fruit" directly into "Move arm 5cm left, squeeze gently."

4. The "Two-Speed" System (Synchronous vs. Asynchronous)

One of the biggest discoveries was how the robot thinks vs. how it moves.

The Old Way (Synchronous): The robot takes a picture, stops moving, thinks hard about what to do, moves, then stops again to take another picture. This is like a driver who stops the car at every red light to read a map before proceeding. It's slow and jerky.
The New Way (Asynchronous): The robot's "brain" (the AI) thinks in the background while the "hands" (the motors) keep moving smoothly. It's like a driver who glances at the map while driving, keeping a steady speed.
Result: The "Two-Speed" system was much smoother and less likely to drop the fruit because the robot didn't freeze up while thinking.

5. The Results: Good, but not Perfect

After training on just a few hours of video, the robot achieved some impressive stats:

Success Rate: It successfully picked about 74% of the strawberries.
Speed: It took about 32 seconds per berry. (This is slower than a human, but it's a huge leap for a robot in such a messy environment).
Damage: It only damaged about 4% of the fruit.

Where did it fail?
Sometimes the robot got confused by heavy shadows or leaves hiding the berry. Sometimes, it would grab the berry, but the berry would spin instead of coming off the stem. These are the "contact dynamics" problems—things that are easy for a human hand to feel but hard for a robot to predict.

The Big Picture

This paper is a proof-of-concept. It shows that we don't need to build a super-expensive, perfectly calibrated robot to pick strawberries. Instead, we can build a robot that learns by watching, uses simple cameras, and adapts to the messy reality of a farm.

It's the difference between teaching a robot a script (which breaks if the script changes) and teaching a robot intuition (which allows it to handle the unexpected). While it's not quite ready to replace all farm workers yet, it's a massive step toward robots that can work in the real, messy world, not just in a perfect lab.

Here is a detailed technical summary of the paper "HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild."

1. Problem Statement

The paper addresses the challenge of automating greenhouse tabletop strawberry harvesting, a task characterized by:

Long-horizon complexity: The process involves multiple sequential stages (target discovery, approach, compliant engagement, detachment, placement, and reset).
Unstructured environments: Significant visual challenges including severe occlusions (leaves/branches), specular reflections, and variable illumination.
Contact sensitivity: Strawberries are delicate; improper contact leads to bruising or damage, requiring compliant control.
Limitations of traditional methods: Modular pipelines (detection $\to$ planning $\to$ control) often struggle with partial observability, require extensive hand-crafting for specific farms, and lack robustness to dynamic state changes.

The authors investigate whether Vision-Language-Action (VLA) policies, typically trained on large-scale multi-modal data, can be effectively transferred to this specific, high-stakes real-world agricultural task with minimal real-world data.

2. Methodology

A. System Architecture (HarvestFlex)

Hardware: A 6-DoF robotic arm equipped with a 2-DoF compliant end-effector (silicone suction gripper actuated by an air pump).
Sensing: A three-view RGB setup (two fixed scene cameras + one wrist-mounted camera). Crucially, the system avoids depth clouds and explicit geometric calibration, relying solely on RGB images to reduce engineering dependencies.
Data Collection:
- Teleoperation: 3.71 hours of demonstrations collected via VR teleoperation (Meta Quest 3) by a single operator.
- Dataset: 227 episodes covering diverse lighting, occlusion levels, and target maturities.
- Strategy: Operators were allowed natural failure-and-recovery behaviors (e.g., re-localization, retrying slips) to match the distribution of closed-loop deployment.

B. Policy Adaptation

The authors fine-tuned three state-of-the-art open-source VLA models: $\pi_0$ , $\pi_0.5$ , and WALL-OSS.

Input: Three-view RGB images + Robot State + Natural Language Goal ("Pick all ripe strawberries...").
Output: Continuous 7-DoF arm velocity commands + Discrete pump commands (suction/inflation/idle).
Training Strategies:
1. Full Fine-tuning: Updating all model parameters.
2. LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning to reduce compute and overfitting.
Loss Function: A weighted combination of L2 loss for continuous arm control and Cross-Entropy loss for discrete pump commands.

C. Deployment Strategies

The paper compares two inference modes:

Synchronous: Serial loop (Image $\to$ Inference $\to$ Action). Prone to jitter and latency-induced failures during contact.
Asynchronous: Decoupled inference and control threads. Inference generates action chunks which are smoothed (weighted averaging) and queued for the real-time control thread. This ensures stable execution frequency regardless of inference latency.

3. Key Contributions

First Real-World VLA Validation: The first systematic study transferring VLA policies to real-world, contact-sensitive strawberry harvesting in a greenhouse setting.
End-to-End Closed-Loop System: Integration of the LeRobot framework with the HarvestFlex platform, enabling continuous perception-decision-control execution without depth sensors.
Reproducible Data Recipe: A method for collecting long-horizon, contact-sensitive demonstrations via VR teleoperation, including natural failure recovery segments.
Comprehensive Evaluation Protocol: A unified benchmark reporting Success Rate (SR), Cycle Time, Damage Rate (DR), and stage-wise success, including ablation studies on camera views and deployment modes.
Empirical Insights: Demonstrated that asynchronous inference significantly outperforms synchronous deployment in contact-rich tasks and that wrist-mounted views are critical for success.

4. Key Results

Performance Metrics (Best Configuration: $\pi_0.5$ + Full Fine-tuning + Asynchronous)

Success Rate (SR): 74.0% (out of 50 trials).
Cycle Time: 32.6 seconds per pick (first-attempt success).
Damage Rate: 4.1% (low severity).
Comparison: While traditional modular systems achieved higher speed (8.3s/pick) and slightly higher SR (89%), they required significantly more engineering effort and struggled with occlusions. VLA achieved robust performance with <4 hours of real-world data.

Ablation Studies

Model Comparison: $\pi_0.5$ outperformed $\pi_0$ and WALL-OSS, achieving the highest SR (74%) and lowest cycle time (32.6s).
Fine-tuning: Full fine-tuning consistently outperformed LoRA in success rates, though LoRA offered a viable trade-off for resource-constrained scenarios.
Inference Mode: Asynchronous inference improved SR from 70% (synchronous) to 74% and reduced cycle time from 45.7s to 32.6s by decoupling control jitter from inference latency.
Sensor Configuration:
- Single/Double Scene Cameras: SR ~10–42%.
- Scene + Wrist Camera: SR jumped to 74%. This highlights that close-range, end-effector-aligned observation is essential for the "envelop and detach" stage.

Failure Analysis

Primary failure modes included:

Detachment failure: Fruit rotation without suction.
Occlusion: Wrist camera blocked during approach, preventing subsequent actions.
Localization: Inability to align beneath the fruit due to severe occlusion.

5. Significance and Conclusion

This work demonstrates that VLA policies are feasible for complex, unstructured agricultural tasks without relying on expensive depth sensors or hand-crafted geometric models.

Rapid Deployment: The system achieved non-trivial success with less than 4 hours of real-world data, contrasting with the months of tuning often required for modular systems.
Robustness: The end-to-end approach showed superior robustness to occlusions and reflections compared to traditional pipelines, though it currently lags in raw speed.
Future Directions: The authors identify the need for better handling of severe occlusions, improved contact dynamics modeling, and further optimization of low-latency deployment to reduce cycle times.

The study establishes a new benchmark for applying foundation models to real-world robotics, proving that generalist VLA policies can be successfully adapted to specialized, high-precision agricultural operations.