CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping
This paper proposes a CLIP-guided multi-task regression framework that leverages level-aware vision-language embeddings to robustly predict plant age and leaf count from multi-view imagery, achieving significant accuracy improvements on the GroMo25 benchmark while simplifying the pipeline and handling incomplete inputs.