ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Imagine you are trying to teach a robot to understand the world, but you only speak Vietnamese, and the robot currently only understands English.

Most of the world's smartest image-searching robots (like the famous CLIP) were trained on billions of English photos and descriptions. If you ask them to find a picture of "a girl in an Ao Dai" (a traditional Vietnamese dress), they might get confused because they've mostly seen "a girl in a dress" described in English. Translating the Vietnamese words to English to use these robots often loses the cultural nuance or adds "translation noise."

This paper introduces ViCLIP-OT, a new robot specifically trained to speak Vietnamese and understand Vietnamese images. But it doesn't just learn the words; it learns how to match pictures and sentences in a much smarter way.

Here is the breakdown using simple analogies:

1. The Problem: The "Language Barrier" and the "Mismatched Puzzle"

Imagine you have a giant box of puzzle pieces. Half are pictures of Vietnamese streets, and the other half are descriptions written in Vietnamese.

Old Robots (CLIP): They try to match the pieces by looking at them one by one. "Does this picture of a street look like this sentence about a street?" It's a bit like a game of "Hot or Cold." It works okay, but it often misses the deeper connection because it doesn't see the big picture of how all the pieces relate to each other.
The Gap: Because the robot was trained mostly on English, the "picture side" of its brain and the "word side" of its brain live in two different rooms. They don't talk to each other well. This is called the Modality Gap.

2. The Solution: The "Traffic Controller" (Optimal Transport)

The authors added a special new feature to the robot called SIGROT (Similarity-Graph Regularized Optimal Transport).

Think of the robot's training process as a busy airport.

The Old Way (Contrastive Learning): The robot tries to match a specific passenger (an image) to a specific flight (a text) by checking their IDs. If the IDs match, great. If not, they are sent to different gates. It's a strict, one-on-one check.
The New Way (SIGROT): Imagine a super-smart Traffic Controller who looks at the entire terminal at once.
- The controller sees that Passenger A (a photo of a busy market) is similar to Passenger B (a photo of a market festival).
- The controller also sees that Flight X (text about "crowds") is similar to Flight Y (text about "festivals").
- Instead of just matching A to X, the controller arranges the whole terminal so that all market photos and all market texts are grouped together in a harmonious circle.

This "Traffic Controller" uses Optimal Transport, a mathematical concept that finds the most efficient way to move things from one place to another with the least amount of "effort" (or error). It ensures that the robot doesn't just match pairs, but understands the relationships between all the images and texts in the batch.

3. The Result: A Perfect Matchmaker

By combining the old "Hot or Cold" game with the new "Traffic Controller," ViCLIP-OT became a master matchmaker.

Better Memory: When you ask it to find "a man holding apples," it doesn't just look for the word "apple." It understands the scene better because it learned how similar scenes relate to each other.
Closing the Gap: The "Modality Gap" (the distance between the picture room and the word room) shrank. The robot's brain became more unified.
Zero-Shot Superpower: Even when the robot saw a new type of Vietnamese image it had never seen before (like a specific local festival), it could still guess the right description because it learned the structure of the language, not just the specific words.

4. The Evidence

The team tested this new robot on three different Vietnamese datasets (like different neighborhoods in a city):

UIT-OpenViIC: A general city of images.
KTVIC: A neighborhood of daily life scenes.
Crossmodal-3600: A global village with photos from everywhere.

The Score:

The old English-trained robots (CLIP) scored around 61-62% on the main test.
The new ViCLIP-OT scored 67-69%.
In the "Zero-Shot" test (where the robot had to guess on completely new data), ViCLIP-OT beat the old robots by a huge margin (over 11% better).

The Takeaway

This paper is like building a specialized translator for a specific culture. Instead of forcing Vietnamese to fit into an English mold, they built a system that respects the unique structure of Vietnamese images and text. By using a "Traffic Controller" (Optimal Transport) to organize the learning process, they created the first foundation model that truly understands the Vietnamese visual world, making it much easier to search for images, build smart assistants, and organize multimedia for Vietnamese speakers.

In short: They taught a robot to see and speak Vietnamese not just by memorizing words, but by understanding the relationships between everything it sees and hears.

1. Problem Statement

Image-text retrieval is a cornerstone of intelligent multimedia systems, yet existing state-of-the-art Vision-Language Models (VLMs) like CLIP and SigLIP are heavily optimized for high-resource languages (primarily English). For low-resource languages like Vietnamese, significant challenges exist:

Data Scarcity: There is a lack of large-scale, high-quality image-text datasets compared to English.
Translation Noise: Common workarounds involve translating Vietnamese captions to English and using English-based models, which introduces translation errors and loses language-specific nuances.
Modality Gap: Standard contrastive learning often fails to capture the complex relational structures between images and texts, leading to suboptimal alignment in the shared embedding space.

The paper aims to bridge this gap by introducing ViCLIP-OT, the first foundation VLM specifically designed for Vietnamese image-text retrieval that overcomes these limitations without relying on translation.

2. Methodology: ViCLIP-OT

ViCLIP-OT extends the standard dual-encoder CLIP architecture by integrating Optimal Transport (OT) into the training objective to enforce global structural consistency.

A. Architecture

Dual-Encoder Design:
- Image Encoder: Utilizes DINOv3 (a Vision Transformer pretrained via self-distillation) to extract robust visual features.
- Text Encoder: Uses a Vietnamese Sentence-BERT (SBERT) pretrained on large-scale Vietnamese corpora.
Embedding Space: Both encoders project inputs into a shared $d$ -dimensional embedding space ( $d=768$ ), followed by $\ell_2$ normalization.

B. The Core Innovation: SIGROT Loss

The model employs a hybrid training objective combining standard contrastive learning with a novel Similarity-Graph Regularized Optimal Transport (SIGROT) loss.

Contrastive Component: Uses either the standard CLIP loss (softmax-based) or SigLIP loss (sigmoid-based) to enforce instance-level alignment (pulling matched pairs closer, pushing mismatched pairs apart).
SIGROT Component:
- Motivation: Contrastive learning treats samples independently. SIGROT introduces a distribution-level perspective to capture relationships among samples within a batch.
- Similarity Graph Construction: Before the OT step, a similarity graph ( $G_{cross}$ ) is constructed using precomputed embeddings (from a robust model like Qwen3-VL-Embedding). This graph encodes intra-modal (text-text, image-image) and inter-modal relationships.
- Optimal Transport (OT): The model solves an Unbalanced Optimal Transport (UOT) problem to find a transport plan ( $\gamma$ ) that minimizes the cost of moving mass between image and text distributions while respecting the relational structure encoded in $G_{cross}$ .
- Loss Formulation: The SIGROT loss is defined as the Kullback-Leibler (KL) divergence between the normalized optimal transport plan and the similarity graph distribution. This forces the model to align not just individual pairs, but the global structure of the batch.

The final objective is:
$L_{hybrid} = \lambda L_{contrastive} + L_{SIGROT}$
where $\lambda$ balances the two terms.

3. Key Contributions

First Vietnamese Foundation VLM: Introduces ViCLIP-OT, the first large-scale foundation model tailored specifically for Vietnamese image-text retrieval, eliminating the need for English translation.
SIGROT Mechanism: Proposes a novel loss function that integrates Optimal Transport with a similarity graph. This captures fine-grained cross-modal alignments and global relational structures, effectively mitigating the "modality gap."
Comprehensive Evaluation: Extensive experiments on three distinct Vietnamese benchmarks (UIT-OpenViIC, KTVIC, and Crossmodal-3600) covering in-domain and zero-shot scenarios.
Open Source: The pretrained models and code are publicly released to support reproducibility and future research in low-resource languages.

4. Experimental Results

The model was evaluated on three datasets:

UIT-OpenViIC: Large-scale open-domain dataset (13k images, 61k captions).
KTVIC: Daily-life scenarios in Vietnam.
Crossmodal-3600: Multilingual dataset (zero-shot evaluation).

Key Findings:

In-Domain Performance (UIT-OpenViIC):
- ViCLIP-OT achieved an average Recall@K of 67.34%, outperforming the baseline CLIP (61.59%) by 5.75 percentage points.
- The ViSigLIP-OT variant achieved 68.96%, surpassing the best zero-shot multilingual model (Qwen3-VL-Embedding-2B) by 13.56 percentage points.
Zero-Shot Generalization:
- On Crossmodal-3600, ViCLIP-OT outperformed CLIP by 11.72 percentage points (56.85% vs 45.13%).
- On KTVIC, it improved upon CLIP by 3.36 percentage points.
Embedding Space Quality:
- Modality Gap: ViCLIP-OT significantly reduced the distance between image and text centroids (Modality Gap dropped from 0.5843 in SigLIP to 0.3177 in ViSigLIP-OT).
- Alignment: The model showed tighter clustering of matched pairs in UMAP visualizations.
Visual Attention (GradCAM): The model demonstrated improved focus on semantically relevant objects (e.g., specific people or items) rather than background noise compared to baselines.

5. Significance and Impact

Low-Resource Strategy: The paper demonstrates that integrating Optimal Transport is a scalable and effective strategy for improving cross-modal retrieval in low-resource languages where data is scarce.
Structural Alignment: By moving beyond instance-level contrastive learning to distribution-level alignment, ViCLIP-OT achieves a more coherent shared latent space, which is crucial for robust zero-shot performance.
Practical Application: The results offer a strong foundation for building intelligent multimedia retrieval systems, visual search engines, and multimodal AI applications specifically for the Vietnamese linguistic context and other underrepresented languages.

6. Ablation Studies

The authors validated the design choices through ablation studies:

Partial Fine-Tuning: Unfreezing the last 13 layers of the DINOv3 image encoder yielded the best performance, suggesting that adapting high-level visual features to the Vietnamese domain is critical.
Loss Weighting: A balance of $\lambda=0.1$ to $0.2$ between contrastive and SIGROT losses was found optimal; too much weight on contrastive loss diminished the benefits of structural regularization.
Graph Construction: Using a cross-modality similarity graph (combining text-text, image-image, and cross-modal similarities) outperformed strategies using only single-modality similarities.