U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

Imagine you are the librarian of the world's most chaotic, magical library. In this library, you don't just have books (text); you have paintings, videos, blueprints, and even 3D sculptures. People walk in and ask for things like, "Find me a picture of a cat that looks like it's wearing a tuxedo," or "Show me a video of a dog running that matches this poem."

This is the challenge of Universal Multimodal Retrieval (UMR). It's about building a system that can understand and find anything, no matter what form it takes.

The paper introduces a new system called U-MARVEL (Universal MultimodAl RetrieVal via Embedding Learning). Think of U-MARVEL as a super-smart, highly trained librarian who has learned the secret art of organizing this chaotic library better than anyone else.

Here is how they built this super-librarian, explained through simple analogies:

1. The Problem: The "Last Word" Trap

Before U-MARVEL, most librarians (AI models) had a bad habit. When reading a long sentence or looking at a complex image, they would only remember the very last word or the very last pixel they saw.

The Analogy: Imagine reading a whole novel but only remembering the final period. You might know the book ended, but you missed the plot, the characters, and the emotion.
The Fix: The researchers realized that to understand the whole story, you need to look at the entire sentence and take an average of all the words. They taught U-MARVEL to look at the whole picture and the whole text together, rather than just the end. This made the librarian much smarter at understanding context.

2. The Training: "Climbing the Mountain" (Progressive Transition)

Training a giant AI model is like teaching a child to read. You don't start them with Shakespeare; you start with "The Cat in the Hat."

The Old Way: Some previous methods tried to throw the model into the deep end immediately, asking it to solve complex, mixed-media puzzles right away. The model got confused and gave up.
The U-MARVEL Way: They used a "Progressive Transition" strategy.
1. Step 1: First, they taught the model to match simple text-to-text (like matching a question to an answer).
2. Step 2: Next, they added images, teaching it to match text to pictures.
3. Step 3: Finally, they gave it the hardest tasks: complex instructions like "Find an image that looks like this, but make the sky blue."
- The Analogy: It's like a video game where you unlock levels one by one. You master the basics before facing the boss fight. This ensured the model didn't get overwhelmed.

3. The "Hard Mode" Practice (Hard Negative Mining)

In retrieval, a "negative" is a wrong answer. A "hard negative" is a wrong answer that looks very similar to the right one.

The Problem: If you ask, "Find a red apple," and the model sees a red ball, a red car, and a red apple, it needs to learn the difference between the ball and the apple. If the training data is too easy (e.g., finding an apple among a pile of bananas), the model gets lazy and doesn't learn the fine details.
The U-MARVEL Way: They specifically fed the model "tricky" wrong answers.
- The Analogy: It's like a coach who doesn't just let the player practice against a slow opponent. They bring in a sparring partner who is almost as good as the player, forcing them to sharpen their skills.
- The Twist: They also realized that sometimes the "tricky" answers were actually too tricky (false negatives). So, they built a filter to remove the "unfair" trick questions, ensuring the model learned from the right kind of challenges.

4. The "Two-Step" Dance vs. The "One-Step" Leap (Distillation)

Usually, to find the perfect answer, systems use a two-step process:

Recall: Quickly scan the whole library to find 100 potential matches. (Fast, but maybe not perfect).
Rerank: Take those 100 and look at them very closely to pick the top 1. (Slow, but very accurate).

The Problem: Doing both steps takes a long time and uses a lot of computer power.
The U-MARVEL Way: They used a technique called Distillation.
- The Analogy: Imagine a master chef (the Reranker) who tastes 100 dishes to pick the best one. U-MARVEL is a student chef who watches the master chef taste them and learns how to taste. Eventually, the student chef becomes so good that they can pick the best dish immediately, without needing the master to taste it first.
- The Result: U-MARVEL combines the speed of the "Recall" step with the accuracy of the "Rerank" step into a single, lightning-fast model.

Why Does This Matter?

The result is a system that is:

Smarter: It understands complex instructions and mixed media (text + image + video).
Faster: It doesn't need to run two separate processes to find an answer.
More General: It works well even on tasks it has never seen before (Zero-Shot), like finding a specific video clip just by describing the action, even if it was only trained on images.

In short, U-MARVEL is the ultimate librarian who learned to read the whole book, practiced on the hardest riddles, and learned to pick the perfect answer instantly, making it the new champion of the search world.

1. Problem Statement

Universal Multimodal Retrieval (UMR) aims to handle complex retrieval tasks where both queries and candidates span diverse modalities (text, images, or combinations) and follow various instructions. While Multimodal Large Language Models (MLLMs) have advanced this field, current state-of-the-art (SoTA) methods predominantly rely on contrastive learning but lack a systematic understanding of the underlying mechanisms driving their performance.

Key Gaps: Existing approaches often adapt MLLMs to embedding tasks without optimizing specific training recipes. Critical design decisions—such as how to extract embeddings from decoder-only architectures, how to manage hard negatives, and how to integrate recall-rerank paradigms into a single model—remain under-explored. This leads to suboptimal performance and limited generalization.

2. Methodology: The U-MARVEL Framework

The authors propose U-MARVEL, a unified framework that systematically investigates and optimizes the design principles for MLLM-based embedding models. The framework is built upon a comprehensive study of three core axes:

A. Adapting Decoder-Only MLLMs for Embedding

The paper challenges standard practices in extracting embeddings from autoregressive MLLMs:

Embedding Extraction: Instead of using the "last token" mechanism (often coupled with compression prompts), the authors find that bidirectional attention combined with mean pooling of the sequence yields superior performance. This avoids recency bias inherent in last-token approaches.
Instruction Integration: During mean pooling, masking instruction tokens is crucial. Since self-attention allows instructions to influence query features, explicitly excluding instruction tokens from the pooling calculation reduces theoretical bias and improves feature discrimination.
Progressive Transition: To adapt a causal MLLM to bidirectional retrieval tasks, a stepwise training curriculum is proposed:
1. Text Retrieval: Fine-tuning on text-only data (NLI) to establish foundational retrieval capabilities.
2. Cross-modal Alignment: Pre-training on text-image pairs (CC3M) to align visual and textual encoders.
3. Instruction-Tuned Multimodal Retrieval: Final fine-tuning on diverse multimodal datasets (M-BEIR).

B. Optimizing Contrastive Learning (InfoNCE)

The study identifies critical interactions between hyperparameters and training strategies:

Batch Size & Learning Rate: Simply increasing batch size yields diminishing returns unless the learning rate is scaled accordingly.
Learnable Temperature: Replacing fixed temperature parameters ( $\tau$ ) with learnable temperature significantly improves performance by dynamically optimizing the sharpness of the probability distribution.
Hard Negative Mining: Directly using top- $k$ $k$ hard negatives often causes model collapse due to false negatives (semantically similar but incorrect labels). The authors propose a filtering strategy:
1. Identify top- $k$ hard negatives.
2. Filter out those exceeding a similarity threshold (removing false negatives).
3. Mix the filtered hard negatives with random in-batch negatives to balance difficulty and stability.

C. Distilling Recall-Then-Rerank into a Single Model

Traditional pipelines use a two-stage process (Recall $\to$ Rerank), which is computationally expensive. U-MARVEL distills this into a single model:

Teacher Model: A pipeline combining a hard-negative recall model and a generative reranker (trained to output "YES"/"NO").
Improved Distillation: Instead of distilling over all pairs (which is computationally prohibitive), the authors construct training samples as (Query, Positive, Top- $k$ Hard Negatives).
Efficiency: This approach reduces computational complexity from $O(N^2)$ to $O(N \cdot k)$ , making distillation feasible while preserving the discriminative power of the reranker.

3. Key Contributions

Systematic Analysis of Design Factors: The paper provides a comprehensive ablation study revealing that often-overlooked factors (bidirectional attention with mean pooling, instruction masking, learnable temperature, and hard negative filtering) are critical for UMR performance.
U-MARVEL Framework: A novel, unified framework that integrates progressive transition, hard negative mining, and efficient knowledge distillation.
Efficient Distillation Strategy: A mathematically analyzed method to distill a recall-rerank pipeline into a single embedding model, reducing training time by ~96% compared to traditional distillation while maintaining high accuracy.

4. Experimental Results

The framework was evaluated on the M-BEIR benchmark (supervised) and various zero-shot tasks.

Supervised Performance (M-BEIR):
- Single-Model Setting: U-MARVEL achieves a new State-of-the-Art (SoTA) with an Average Recall of 63.2% (Local) and 60.7% (Global), significantly outperforming competitors like LamRA-Ret (56.6%) and MM-Embed.
- Recall-Rerank Setting (U-MARVEL+): By combining the embedding model with a reranker, it achieves 64.8% (Local) and 61.8% (Global), surpassing all existing methods.
- Efficiency: U-MARVEL achieves performance comparable to two-stage pipelines using only a single-stage inference, offering a superior accuracy-efficiency trade-off.
Zero-Shot Generalization:
- U-MARVEL demonstrates strong generalization on unseen datasets, achieving SoTA on 9 out of 12 zero-shot tasks (including composed image retrieval and text-to-video retrieval).
- It outperforms models like VLM2Vec and UniME on text-to-video benchmarks (MSR-VTT, MSVD) despite being trained primarily on image-text data, highlighting the robustness of the progressive training curriculum.
Ablation Studies:
- Removing the "Progressive Transition" stage causes a significant drop in performance.
- "Hard Negative Mining" provides a boost in discrimination.
- "Distillation" is critical for closing the gap between single-model and two-stage systems.

5. Significance

Paradigm Shift: The paper shifts the focus from merely applying MLLMs to retrieval tasks toward a rigorous engineering of the embedding learning process. It proves that architectural choices (e.g., mean pooling vs. last token) and training dynamics (e.g., hard negative filtering) are as important as the model backbone itself.
Practical Deployment: By successfully distilling a complex recall-rerank pipeline into a single efficient model, U-MARVEL makes high-performance universal retrieval feasible for real-world applications where latency and computational cost are constraints.
Generalization: The framework's ability to generalize across modalities (text, image, video) and tasks (zero-shot) suggests a robust path forward for building truly universal multimodal agents.

In conclusion, U-MARVEL establishes a new benchmark for universal multimodal retrieval by uncovering the "recipe" for effective embedding learning with MLLMs, offering both state-of-the-art performance and practical efficiency.