GDA-YOLO11: Amodal Instance Segmentation for Occlusion-Robust Robotic Fruit Harvesting

Imagine a robot trying to pick an orange from a tree. The problem? The orange is hiding behind a bunch of leaves.

Most robots today are like people with very bad eyesight: if they can't see the whole orange, they get confused. They might think the orange isn't there, or they might try to grab just the tiny sliver of orange they can see, missing the fruit entirely or damaging the tree. This leads to wasted food and unhappy farmers.

This paper introduces a new "super-robot brain" called GDA-YOLO11 that solves this problem. Here is how it works, explained simply:

1. The "X-Ray Vision" Trick (Amodal Segmentation)

Imagine you are looking at a car parked behind a fence. You can only see the wheels and part of the roof. A normal camera just sees the parts it can touch.

GDA-YOLO11 is different. It uses a technique called Amodal Segmentation. Think of it as the robot having "X-ray vision" or a "mental imagination." Even if the robot only sees 30% of the fruit, it doesn't just guess; it mathematically "draws" the invisible 70% of the fruit in its mind. It knows exactly where the whole fruit is, even the parts hidden behind leaves.

2. How the Robot Learned to See Better

The researchers took a standard, fast robot brain (called YOLO11) and gave it three specific upgrades, like adding special tools to a Swiss Army knife:

The "Spotlight" (Global Attention Module): Imagine the robot is in a messy room full of leaves and fruit. It used to get distracted by every leaf. The new "Spotlight" module helps the robot ignore the noise (leaves) and focus intensely on the shape of the fruit, even if it's partially hidden.
The "Deep Dive" (Deepened Head): The robot's brain got a little deeper. Instead of just skimming the surface of the image, it now looks closer at the details. This helps it figure out exactly where the edge of the fruit is, even when it's blurry or cut off by a branch.
The "Strict Teacher" (Asymmetric Loss): When the robot was learning, it made mistakes. Usually, teachers punish mistakes equally. But this new "teacher" was stricter about one thing: missing the fruit. If the robot said, "I don't see the fruit," but the fruit was actually there (just hidden), the teacher gave it a big penalty. This forced the robot to be extra careful and try to find the fruit even when it was hard to see.

3. The "Safe Grab" Strategy

Once the robot "imagines" the whole fruit, it needs to grab it.

Finding the Sweet Spot: Instead of grabbing the edge (which might be a leaf), the robot calculates the "center of gravity" of the whole fruit, including the invisible parts. It's like finding the center of a balloon even if someone is holding a piece of paper in front of it.
The Approach: The robot moves its arm to a safe spot, then gently pushes forward to grab the fruit. Because it knows where the whole fruit is, it doesn't accidentally squeeze a leaf or miss the fruit.

4. The Results: Does it Work?

The researchers tested this in a lab with a fake tree and real oranges. They covered the oranges with leaves to different degrees:

No leaves: Both the old robot and the new robot did great.
A few leaves: Both did well.
Heavy leaf cover: This is where the magic happened.
- The old robot (YOLO11) got confused and failed to pick about 80% of the heavily hidden fruits.
- The new robot (GDA-YOLO11) was much better, successfully picking nearly double the amount of hidden fruit compared to the old one.

The Big Picture

Think of this technology as giving a robotic farmer a pair of glasses that let them see through the clutter of a garden. By teaching the robot to "imagine" the full shape of the fruit, they can harvest more food, waste less, and work in messy, real-world fields where leaves and branches are always in the way.

This is the first time this kind of "imagination" has been successfully tested on a real robot arm picking real fruit, paving the way for fully autonomous farms in the future.

1. Problem Statement

Robotic fruit harvesting faces a critical challenge: occlusion. In natural agricultural environments, fruits are often partially hidden by leaves or other plant organs.

Limitations of Current Methods: Standard instance segmentation models only detect visible portions of an object. This leads to inaccurate localization of the fruit's center and picking points, causing missed picks, fruit damage, or mechanical collisions.
Gap in Literature: While "amodal instance segmentation" (predicting the complete shape of an object, including occluded regions) has been explored in computer vision, previous studies have largely remained at the perception level. There is a lack of integration between amodal perception and physical robotic action, with few studies validating these models in real-world harvesting pipelines.

2. Methodology

The authors propose a perception-to-action framework centered on a novel deep learning model, GDA-YOLO11, built upon the lightweight YOLO11n architecture.

A. Model Architecture Improvements

The GDA-YOLO11 introduces three core enhancements to handle occlusion:

Global Attention Module (GAM) Integration:
- A GAM block is embedded at the end of the model's neck.
- The standard Cross Stage Partial with Spatial Attention (C2f-PSA) block is replaced with a second GAM.
- Function: These modules sequentially apply channel and spatial attention to refine feature maps, helping the network focus on relevant features and infer occluded regions by capturing global context.
Deepened Segmentation Head:
- The prediction head is structurally deepened (increasing intermediate channels from 32 to 64 and input dimensions from 256 to 512).
- Function: This allows for richer feature extraction, improving the resolution of fine-grained boundaries for partially visible objects.
Asymmetric Mask Loss Function:
- The standard Binary Cross Entropy (BCE) loss is replaced with an Asymmetric BCE loss.
- Mechanism: It penalizes False Negatives (missing parts of the fruit) more heavily (weight $\alpha_{FN} = 1.1$ ) than False Positives (weight $\alpha_{FP} = 0.9$ ).
- Goal: This biases the model to generate more complete masks, ensuring that even heavily occluded fruits are segmented as fully as possible.
SPPF Enhancement: The kernel size of the Spatial Pyramid Pooling-Fast (SPPF) block is increased to 7x7 to expand the receptive field.

B. Robotic Harvesting Pipeline

The framework integrates the segmentation model into a physical robotic system (Emika Franka Panda arm with Intel RealSense D415 camera):

Inference: The model generates a full amodal mask (including invisible regions) for the fruit.
Picking Point Estimation: A Euclidean Distance Transform is applied to the amodal mask to find the optimal picking point. This method maximizes the distance from the object boundary, ensuring the robot grasps the most stable, central region of the fruit, even if parts are hidden.
3D Localization: Using the RGB-D camera and hand-eye calibration, the 2D picking point is back-projected to 3D coordinates in the robot's base frame.
Execution: The robot executes a two-stage motion (pre-grasp alignment + linear approach with a safety offset) to grasp the fruit.

3. Key Contributions

Novel Model (GDA-YOLO11): Development of a specialized amodal instance segmentation model tailored for occluded fruit, achieving high precision with minimal parameter increase (~18% more than baseline).
First Practical Demonstration: This study provides the first known integration of amodal instance segmentation into a complete robotic fruit harvesting pipeline, bridging the gap between perception and physical manipulation.
Comprehensive Validation: The system was validated not only on standard benchmarks but also through real-world robotic experiments simulating varying levels of occlusion (Zero, Low, Medium, High).

4. Experimental Results

A. Segmentation Performance

Evaluated on a modified citrus dataset (1,000 images) and occlusion-sensitive subsets:

Metrics: GDA-YOLO11 achieved Precision: 0.844, Recall: 0.846, mAP@50: 0.914, and mAP@50:95: 0.636.
Comparison: It outperformed the baseline YOLO11n by 5.1% in precision, 1.3% in mAP@50, and 1.0% in mAP@50:95.
Efficiency: Despite improvements, inference time increased by only 1.3 ms, maintaining real-time capabilities.

B. Robotic Harvesting Success Rates

Experiments were conducted with 54 fruits per occlusion level (Total 216 trials per model):

Occlusion Level	GDA-YOLO11 Success Rate	Baseline YOLO11 Success Rate	Improvement
Zero (Fully Visible)	92.59%	96.29%	-3.7%
Low (0-20%)	85.18%	85.18%	0%
Medium (20-50%)	48.14%	44.44%	+3.7%
High (>50%)	22.22%	18.51%	+3.7%

Key Finding: While the baseline performed slightly better on fully visible fruits, GDA-YOLO11 significantly outperformed it in medium and high occlusion scenarios, demonstrating superior robustness.
Correlation: A strong correlation ( $R^2 \approx 0.986$ ) was observed between segmentation accuracy (mAP@50) and physical harvesting success, validating that better amodal masks directly translate to better robotic performance.

5. Significance

Occlusion Robustness: The study proves that amodal segmentation is a viable solution for the "occlusion problem" in agriculture, allowing robots to "see" through leaves to locate the true center of a fruit.
System Simplification: Unlike previous methods that required complex multi-step processes (e.g., 3D reconstruction, geometric fitting, or multi-view fusion), GDA-YOLO11 achieves full shape inference in a single end-to-end deep learning pass.
Scalability: By maintaining a lightweight architecture (nano-scale) and real-time inference speeds, the proposed framework is highly suitable for deployment on edge devices in agricultural robots.
Future Direction: The paper highlights that while current vision systems struggle with >50% occlusion, this work establishes a foundational step toward reliable autonomous harvesting, suggesting future work should focus on advanced sensing or reasoning mechanisms for extreme occlusion.