ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport
This paper introduces ViCLIP-OT, a novel foundation vision-language model that combines CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport loss to achieve state-of-the-art performance in Vietnamese image-text retrieval across both in-domain and zero-shot settings.