CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

Imagine you are trying to guess how old a child is and how many teeth they have just by looking at photos of them. Now, imagine you have 120 photos of that same child: some taken from the front, some from the back, some from high up, and some from low down.

If you just looked at all 120 photos, you'd get confused. A photo taken from the floor might make a small child look like a giant, while a photo from the ceiling might make a tall child look tiny. Plus, looking at 120 nearly identical photos is a waste of time for a computer—it's like reading the same page of a book 120 times to understand the story.

This is exactly the problem scientists face when trying to measure plant growth (phenotyping) using robots or cameras. This paper presents a clever new way to solve that puzzle.

The Problem: Too Many Angles, Not Enough Clues

In the past, researchers tried to solve this by building two separate teams of robots:

One team to guess the plant's age.
Another team to count the leaves.

They also tried to be "smart" by picking only a few photos to look at, hoping to avoid the confusion of too many angles. But this was clunky. If a robot missed a photo or if the camera was shaky, the whole system could fail. It was like trying to solve a mystery with two different detectives who never talk to each other.

The Solution: The "Bilingual" Detective

The authors propose a single, super-smart detective that does both jobs at once. They call this a "Vision-Language" model. Here is how it works, using simple analogies:

1. The "Smart Glasses" (CLIP)

The model uses a technology called CLIP. Think of CLIP as a detective who has read millions of books and seen millions of pictures. It doesn't just see "green leaves"; it understands concepts like "a young sprout" or "a mature bush."

The Trick: Instead of just looking at the picture, this detective can also "read" a note. If you tell it, "This photo was taken from a low angle," it instantly adjusts its brain to understand that the plant might look bigger than it really is.

2. The "Group Hug" (Aggregating Views)

Instead of looking at 24 different photos of the same plant from different angles, the model takes all 24 photos and gives them a "group hug." It averages them out.

Why? If one photo is blurry or blocked by a leaf, the other 23 photos save the day. The result is one perfect, "angle-proof" summary of what the plant looks like, regardless of where the camera was standing.

3. The "Height Hint" (Level-Awareness)

This is the secret sauce. The model knows that a plant looks different if you look at it from the ground versus from a ladder.

The Analogy: Imagine looking at a tree. From the ground, you see the trunk. From a ladder, you see the leaves. If you don't know where you are looking from, you might think the tree is two different trees!
The Fix: The model asks itself, "What level is this?" If the camera data is missing, the model guesses the level based on the picture and then uses that guess to adjust its final answer. It's like a detective who says, "I think this photo was taken from the second floor, so I'll adjust my age estimate accordingly."

The Results: A Big Win

The researchers tested this on a famous dataset called GroMo25, which involves mustard, radish, and wheat plants.

Old Way: The previous best methods were off by about 7.7 days when guessing age and 5.5 leaves when counting.
New Way: Their new single-model detective reduced the error to just 3.9 days and 3.1 leaves.

That's a massive improvement! It's like going from guessing a person's age within a week to guessing within a few days.

Why This Matters

One Tool, Two Jobs: You don't need two separate systems. One model does it all, saving money and computer power.
Forgiving: If a farmer's robot misses a few photos because of a glitch or a leaf blocking the lens, this model doesn't crash. It uses its "bilingual" brain (pictures + text clues) to fill in the gaps.
Future Farming: This helps farmers monitor crops without touching them, leading to better food production with less waste.

In a nutshell: The authors built a single, smart AI that looks at many photos of a plant, ignores the confusing camera angles, uses "text clues" to understand the perspective, and accurately guesses both the plant's age and leaf count—even if some photos are missing. It's like having a super-observant gardener who never gets confused by where they are standing.

1. Problem Statement

The paper addresses the challenges of multi-view plant phenotyping, specifically the automated estimation of plant age and leaf count. While multi-view imagery offers rich 3D information, it introduces significant modeling difficulties:

Viewpoint Redundancy: Hundreds of images are often highly correlated (e.g., adjacent rotation angles look nearly identical), leading to computational inefficiency and overfitting.
Viewpoint-Dependent Ambiguity: Plant appearance changes drastically based on the camera's height and angle. For instance, a young plant viewed from a low angle may resemble the base of a mature plant, or a dense canopy viewed from above may look like an older plant.
Incomplete Inputs: Real-world deployment often involves missing views (due to occlusion or user error) or unordered inputs, which breaks models relying on dense, structured multi-view data.
Fragmented Pipelines: Existing solutions typically use separate models for different traits (age vs. leaf count) or rely on complex view-selection heuristics, failing to exploit the inherent correlation between phenotypic traits.

2. Methodology

The authors propose a Level-Aware Vision-Language Framework built on CLIP (Contrastive Language-Image Pre-training) embeddings. The core innovation is a unified, single-model architecture that performs multi-task regression while explicitly reasoning about viewpoint geometry.

A. Preprocessing Pipeline

Object-Centric Localization: A pretrained Grounding DINO model is used to detect the plant and its pot, generating tight bounding boxes. This removes background noise and ensures the model focuses on semantically relevant regions.
Visual Encoding: Cropped images are resized and encoded by the CLIP Vision Encoder into 512-dimensional embeddings.
Angle-Invariant Aggregation: For each of the five height levels, the 24 rotational view embeddings are averaged element-wise. This creates a single, angle-invariant representation ( $\bar{E}_{level}$ ) per height level, reducing redundancy and handling missing views gracefully.

B. Multi-Task Regression Architecture

The framework replaces the traditional dual-model setup with a single network that predicts both age and leaf count simultaneously.

Multimodal Fusion:
- Visual Input: The aggregated visual embedding ( $\bar{E}_{level}$ ).
- Textual Priors: A lightweight text prompt (e.g., "a plant at approximately level X") is encoded using the CLIP Text Encoder to produce a text embedding ( $\hat{E}_{text}$ ).
- Fusion: The visual and text embeddings are concatenated to form a 1024-dimensional vector, which is passed through a Multi-Layer Perceptron (MLP) regressor.
Level-Aware Conditioning:
- Training: The model is conditioned on the known ground-truth height level.
- Inference (Missing Metadata): If the height level is unknown, an auxiliary regressor predicts the most likely level ( $\hat{\ell}$ ) from the visual embedding. This predicted level is then converted into a text embedding to guide the main regressor. This allows the model to resolve visual ambiguities (e.g., distinguishing a low-angle young plant from a high-angle mature one) even without explicit metadata.
Loss Function: The model is trained end-to-end using a composite loss function: $L = \text{MSE}(\text{age}) + \text{MSE}(\text{leaf count})$ , enabling positive transfer of features between tasks.

3. Key Contributions

Unified Multi-Task Framework: The first single-model approach to jointly predict plant age and leaf count from multi-view inputs, replacing the conventional dual-model paradigm and enabling feature sharing between correlated traits.
Level-Aware Multimodal Fusion: A novel strategy that disentangles viewpoint-induced appearance changes from genuine phenotypic growth by conditioning visual features on lightweight text priors encoding height/level information.
Robustness to Incomplete Data: The architecture is designed to handle missing or unordered views. By aggregating views into level representations and using learned level estimators for text conditioning, the model maintains performance even when metadata is missing or views are occluded.
State-of-the-Art Performance: The method achieves superior accuracy on the GroMo25 benchmark while simplifying the inference pipeline.

4. Experimental Results

Experiments were conducted on the GroMo25 dataset (Mustard, Radish, and Wheat) using a plant-based split to ensure no data leakage.

Accuracy Improvements:
- Compared to the GroMo baseline, the proposed method reduced the Mean Absolute Error (MAE) for plant age from 7.74 to 3.91 (49.5% improvement) and for leaf count from 5.52 to 3.08 (44.2% improvement).
- Compared to a strong Unimodal CLIP baseline (image only), the multimodal approach further reduced age MAE from 4.12 to 3.91 and leaf-count MAE from 3.43 to 3.08.
Robustness to Missing Views:
- The model demonstrated high resilience to image removal. Leaf-count predictions remained stable until 70–80% of images were removed.
- Under extreme view removal (down to 1 image), the multimodal approach showed 12.9% less performance degradation compared to the unimodal baseline.
Efficiency: Unlike competitors like ViewSparsifier (which requires separate models per task), this approach uses a single unified model, streamlining deployment and reducing computational overhead.

5. Significance

This work represents a significant step forward in precision agriculture by demonstrating that vision-language models (VLMs) can be effectively repurposed for regression tasks in complex, multi-view agricultural settings.

Generalization: It proves that CLIP's semantic priors, when combined with geometric conditioning, can generalize across different crop species and handle the "viewpoint ambiguity" problem that plagues traditional computer vision models.
Practical Deployment: By removing the dependency on dense, perfectly ordered multi-view inputs and explicit metadata, the method is more suitable for real-world scenarios where data collection is often imperfect.
Pipeline Simplification: The shift from multiple specialized models to a single, unified multi-task framework offers a more scalable and maintainable solution for automated plant monitoring.