SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants

Imagine you are a robot chef trying to pick a ripe pepper from a bush. The problem? The bush is a tangled mess. The pepper you want is hiding behind a wall of leaves, and you can't see the stem holding it. If you just reach in blindly, you might snap the stem, crush the fruit, or get your arm stuck in a thicket of leaves.

To solve this, you don't just need to see the pepper; you need to understand the relationships between the pepper, the leaves, and the stem. You need to know: "Which specific leaf is blocking my view? If I push that one aside, will the pepper be free?"

This paper introduces SG-DOR, a smart AI system designed to be that "brain" for agricultural robots. Here is how it works, explained simply:

1. The Problem: The "Blind Reach"

In a dense pepper plant, fruits are often hidden. Current robots are like people trying to find a specific book in a messy room by just looking at the top shelf. They can see the fruit, but they don't know what is hiding it or which way to move to see it better. They lack a mental map of "who is blocking whom."

2. The Solution: A "Social Network" for Plants

The authors created a system called SG-DOR (Scene Graphs with Direction-Conditioned Occlusion Reasoning). Think of this as building a social network profile for every single part of the plant.

The Nodes (The People): Every leaf, stem, and pepper is a "person" in this network.
The Edges (The Relationships): The system draws lines connecting them. It learns that "Leaf A is attached to Stem B" and "Leaf C is standing in front of Pepper D."
The Secret Sauce (Direction): This is the magic part. The system doesn't just ask, "Is the pepper hidden?" It asks, "Is the pepper hidden if I look from the top? What about if I look from the side?"

3. How It Thinks: The "Crowded Room" Analogy

Imagine you are in a crowded room trying to talk to a friend (the pepper).

Old Robots: They see your friend but don't know who is standing in the way. They might try to push through the whole crowd.
SG-DOR: It acts like a super-observant host. It looks at the crowd and says:
- "Okay, if you approach from the North, Leaf 1 is the main blocker. If you push Leaf 1, you're good."
- "But if you approach from the East, Leaf 2 is the one blocking you."
- It even ranks them: "Leaf 1 is the biggest problem, Leaf 2 is a minor problem."

4. How It Learned: The "Video Game" Training

You can't easily teach a robot this in a real greenhouse because it's too messy and hard to see the "truth" (you can't see the hidden parts of the plant).

The Simulation: The researchers built a massive, perfect video game world of pepper plants. In this game, they knew exactly where every leaf was and exactly how much it blocked the pepper from every angle.
The Training: They fed this game data to the AI. The AI played millions of rounds, learning to predict: "If I see this shape, and I'm looking from this angle, this specific leaf is the one hiding the fruit."

5. The Result: A "To-Do List" for Robots

When the robot looks at a real pepper plant, SG-DOR doesn't just give a picture. It gives a strategic plan:

Identify: "Here is the target pepper."
Analyze: "From your current angle, these three leaves are blocking it."
Rank: "Leaf #1 is the biggest blocker. Leaf #2 is next. Leaf #3 is barely in the way."
Act: "Robot, please gently push Leaf #1 aside first. Then you can grab the pepper."

Why This Matters

This isn't just about picking peppers; it's about precision. Instead of a robot blindly hacking away at a plant (which damages the crop), it acts like a skilled gardener who knows exactly which branch to move to reveal the fruit.

The paper proves that by teaching robots to understand direction and relationships (not just shapes), we can make them much better at harvesting crops in messy, real-world environments. It turns a chaotic bush into a structured map that a robot can actually navigate.

Here is a detailed technical summary of the paper "SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants."

1. Problem Statement

In precision horticulture, particularly for sweet pepper harvesting, robotic manipulation is severely hindered by dense foliage and self-occlusion. Fruits are often partially hidden by leaves, and peduncles (fruit stems) are not fully observable.

The Gap: Existing state-of-the-art pipelines can detect and map plant instances but operate at the object level. They lack explicit, direction-conditioned relational reasoning. They cannot identify which specific leaves obstruct a target fruit from a specific approach direction, nor can they rank these occluders by severity.
The Challenge: Current methods either rely on active perception (moving the camera) or amodal completion (guessing hidden geometry) but fail to provide the structured, actionable data needed for downstream tasks like targeted leaf pushing, pruning, or safe grasping. There is a need for a system that explicitly models the "who blocks whom" relationship relative to a specific viewpoint.

2. Methodology: SG-DOR

The authors propose SG-DOR (Scene Graphs with Direction-Conditioned Occlusion Reasoning), a relational learning framework that takes instance-segmented 3D point clouds of plant organs and outputs a scene graph encoding structural attachments and occlusion rankings.

A. Data Generation

Due to the difficulty of obtaining ground-truth occlusion labels in real greenhouses, the authors created a biologically consistent synthetic dataset:

Pipeline: Built using BlenderProc with curated stem, leaf, peduncle, and fruit prototypes.
Topology: Enforces biological constraints (e.g., leaves attach to stems, fruits to peduncles).
Occlusion Labeling: Uses a voxelized point cloud approach with a multi-layer Z-buffer ( $Z=3$ ). For each fruit, visibility is evaluated along 18 canonical directions in a fruit-local coordinate frame.
Labels: Includes graded occlusion mass (fraction of fruit voxels blocked), potential occlusion (binary), exclusive occlusion (single dominant leaf), and rank-based relevance.

B. Network Architecture

The model is a multi-task learning framework operating on a directed instance graph $G=(V, E)$ .

Instance-Level Encoding:
- Uses PointNet++ to encode raw point clouds into intrinsic geometric embeddings ( $h_{id}$ ), ignoring color/normals to focus on spatial structure.
- Includes a local semantic head to predict organ types (stem, leaf, peduncle, fruit) to guide edge construction.
Relational Backbone:
- Constructs an over-complete candidate graph using k-NN, radius search, and a specialized stem-connection rule.
- Refines node embeddings using a Residual GINE (Graph Isomorphism Network with Edge features). This allows message passing that considers both organ identity and relative geometry.
Direction-Conditioned Occlusion Module (Core Innovation):
- Architecture: A dual-stream Cross-Attention mechanism.
- Query: A fruit embedding combined with a learned direction embedding ( $e_k$ ).
- Keys/Values: Contextualized leaf embeddings generated via a Self-Attention encoder over the candidate leaf set. This allows the model to understand competition and redundancy among leaves before evaluating a specific direction.
- Prediction: The model predicts:
  - Union Visibility: The total visibility reduction for the fruit in direction $k$ .
  - Pairwise Potentials: The occlusion score of specific leaf-fruit pairs.
  - Ranking: A listwise ranking of leaves based on their relevance as occluders.

C. Training Objectives

The model is trained end-to-end with a multi-task loss balancing:

Node/Edge Classification: Cross-entropy for organ types and edge existence.
Geometric Regression: Smooth L1 loss for centroid offsets and bounding box extents.
Occlusion Losses:
- Union Loss: Binary cross-entropy for global visibility.
- Pairwise Potential Loss: Focused on informative directions.
- Listwise Ranking Loss: A competitive softmax objective to prioritize the most severe occluders.
- Consistency Regularizer: Ensures the sum of predicted pairwise occlusions aligns with the predicted union visibility (using a noisy-OR model).

3. Key Contributions

Novel Task Formulation: Defines direction-conditioned 3D occlusion reasoning as a relational learning problem, moving beyond binary occlusion to graded, ranked occlusion in a fruit-local frame.
SG-DOR Framework: Introduces a graph neural network with per-fruit leaf-set self-attention and direction-conditioned cross-attention, enabling the model to reason about leaf competition and specific approach angles.
Synthetic Dataset: Releases a large-scale, procedurally generated multi-plant pepper dataset with ground-truth directional occlusion labels, addressing the lack of real-world supervision for this specific task.

4. Experimental Results

Experiments were conducted on the synthetic dataset with validation against ray-casting simulations and a physical mock-up.

Occlusion Ranking: SG-DOR achieved an NDCG@3 of 0.85 and Recall@1 of 0.46, significantly outperforming baselines.
- Ablation: Removing the self-attention mechanism (SG-DOR-w/oSelfAttn) dropped NDCG@3 to 0.567, proving that modeling leaf competition is critical.
- Ablation: Removing explicit pairwise geometry features dropped performance, showing that latent embeddings alone are insufficient for spatial reasoning.
Visibility Estimation: The model achieved a low Mean Absolute Error (MAE) of 0.109 for highly occluded viewpoints ( $U_{high}$ ).
Structural Inference: Edge existence F1 score reached 0.83, demonstrating the model preserves structural attachment predictions while learning occlusion.
Robustness:
- Geometric Noise: The model remained robust under 4mm XYZ jitter, maintaining high structural accuracy.
- Viewpoint Shift: Performance remained stable under perspective projection shifts and angular jitter, confirming the model learns volumetric relationships rather than overfitting to Z-buffer artifacts.
Real-World Validation: The model successfully identified and ranked occluding leaves on a physical pepper plant mock-up zero-shot (without fine-tuning), correctly isolating the top 3 occluders.

5. Significance

SG-DOR bridges the gap between 3D perception and actionable robotic manipulation in agriculture.

Actionable Intelligence: Instead of just knowing a fruit is "occluded," the system provides a ranked list of specific leaves to push or prune, enabling efficient, targeted interventions.
Generalizability: The direction-conditioned approach allows robots to plan grasp trajectories or manipulation paths based on the specific occlusion profile from that angle.
Foundation for Automation: By providing a structured scene graph with explicit occlusion reasoning, SG-DOR enables the next generation of autonomous horticultural robots to operate effectively in dense, cluttered environments without relying on brute-force exploration or perfect visibility.