Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning
This paper proposes a novel approach to image-based shape retrieval that leverages pre-aligned multi-modal encoders and a hard contrastive learning loss to achieve state-of-the-art performance in both zero-shot and supervised settings, eliminating the need for explicit view-based supervision or view synthesis.