CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension

Imagine you have a super-smart robot assistant named CREM. This robot is a master at two very different jobs:

The Librarian: It can look at a picture and a sentence, instantly find the perfect match in a library of millions, and say, "Here is the book you need!" (This is Retrieval).
The Storyteller: It can look at a picture and write a beautiful, detailed story about what's happening in it (This is Generation).

The Problem: The "Split Personality" Crisis

Before CREM, AI models were like people with split personalities.

If you trained a model to be a great Librarian, it became excellent at finding things but forgot how to tell stories. It became a "dumb" search engine that couldn't chat.
If you trained a model to be a great Storyteller, it could write amazing stories but was terrible at finding specific items in a huge database. It was too chatty and unfocused for search.

Scientists tried to fix this by forcing the model to do both at once, but it was like asking a chef to also be a mechanic. The chef got confused, and neither job was done well. The model would lose its "generative" magic just to get better at "retrieval."

The Solution: The "Chorus" and the "Compression" Trick

The authors of this paper realized that both jobs actually rely on the same brain power: understanding the connection between images and words.

They created CREM (Compression-driven Representation Enhanced Model) using a clever two-part strategy:

1. The "Chorus" (The Magic Summarizer)

Imagine you are listening to a choir. Instead of remembering every single note sung by 100 singers, you just remember the Chorus—the catchy, condensed part that holds the main melody.

Old Way: The robot tried to remember every single pixel of an image and every single word of a sentence. This was too much data, making it slow and confused.
CREM's Way: The robot creates a special set of "Chorus Tokens." It looks at the whole image and the whole text, then compresses all that information into just 16 tiny, super-smart "Chorus Tokens."
- Think of these tokens as a highly compressed zip file of the image's meaning.
- When the robot needs to search (Retrieval), it just looks at the "Chorus."
- When the robot needs to tell a story (Generation), it uses the "Chorus" as a cheat sheet to remember what the image looked like, so it doesn't have to re-read the whole thing.

2. The "Compression-Aware" Training

To teach the robot this new way of thinking, they used a special training method:

They told the robot: "Sometimes, I want you to find a match using only the 'Chorus' summary. Other times, I want you to write a story using that same summary."
By forcing the robot to do both tasks using the same compressed summary, it learned that the "Chorus" must contain everything important. It couldn't just be a vague summary; it had to be a perfect, dense representation of the truth.

The Result: The Best of Both Worlds

The results were amazing:

Super Search: CREM became the best at finding images and text matches (beating previous "Librarian" specialists).
Super Storytelling: It didn't lose its ability to tell stories. In fact, because it learned to compress information so well, it could tell stories even faster.
Memory Saver: Because it uses these tiny "Chorus Tokens" instead of the whole image, it uses way less computer memory. It's like carrying a pocket-sized map instead of a giant atlas.

The Analogy in a Nutshell

Imagine you are trying to describe a movie to a friend.

The Old Way: You try to describe every single frame, every line of dialogue, and every background detail. Your friend gets bored, and you forget the plot.
The CREM Way: You create a 30-second "Chorus" trailer that captures the entire essence of the movie.
- If your friend asks, "Did this movie have a car chase?" you check the trailer (Retrieval).
- If your friend asks, "What was the ending?" you use the trailer to recall the story and tell them (Generation).

CREM proves that you don't have to choose between being a good searcher or a good storyteller. By learning to compress information into a powerful, shared summary, you can be both at the same time.

1. Problem Statement

Multimodal Large Language Models (MLLMs) excel at generative tasks like visual question answering (VQA) and image description. However, applying them directly to embedding-based tasks (e.g., image-text retrieval) is challenging due to a fundamental mismatch:

Generation relies on next-token prediction and requires access to fine-grained visual details to produce coherent text.
Retrieval requires a compact, holistic representation (usually a single vector) to match queries and documents efficiently.

Existing approaches typically fine-tune MLLMs using contrastive learning to create embeddings. While this improves retrieval, it often destroys the model's generative capabilities, rendering it unable to answer questions or describe images effectively. Conversely, standard MLLMs perform poorly on retrieval because their next-token prediction mechanism does not naturally aggregate visual information into a single high-quality embedding. The core challenge is: Can a single model achieve state-of-the-art retrieval performance without sacrificing its generative comprehension abilities?

2. Methodology: The CREM Framework

The authors propose CREM (Compression-driven Representation Enhanced Model), a unified framework that bridges retrieval and generation through a compression-driven paradigm.

A. Compression-Based Prompt Design

Instead of treating retrieval and generation as separate tasks with different prompts, CREM unifies them using a specific prompt structure:

Chorus Tokens ( $U$ ): The model introduces a set of learnable "chorus tokens" inserted between the embedding instruction and the generation instruction.
Semantic Compression: These tokens act as a bridge, aggregating the semantic information from the raw visual tokens ( $V$ ) and text tokens ( $T$ ) into a compact representation.
Unified Prompt Structure:
- Input: [Image] [Embedding Instruction] <Chorus Tokens> [Generation Instruction]
- Output: [Answer]
- The chorus tokens serve as the compressed representation for retrieval, while also acting as the context for generating answers.

B. Compression-Aware Attention Mechanism

To ensure the model learns to compress information effectively, CREM employs a custom attention mask:

Visibility Constraints: The Chorus Tokens can attend to all preceding Vision and Text tokens.
Asymmetric Flow: The Question and Answer tokens (generation output) are restricted to attend only to the Chorus Tokens (and previous answers), not the raw Vision/Text tokens.
Effect: This forces the model to distill all necessary visual and textual information into the Chorus Tokens during the forward pass, ensuring the retrieval embedding is rich and the generation is based on this compressed context.

C. Compression-Driven Training Strategy

CREM jointly optimizes two objectives within a single framework:

Contrastive Loss (Retrieval): The pooled representation of the Chorus Tokens is used to compute the InfoNCE loss for image-text matching.
Generative Loss (Language Modeling): The model is trained to generate answers based solely on the Chorus Tokens (compressed context) with a certain probability ( $p$ ), and based on full context otherwise. This stochastic approach ensures the model remains robust even when relying on compressed representations.
Data Mixing: The training utilizes two types of data:
- Homogeneous Data: Retrieval pairs augmented with QA data generated by an MLLM (ensuring consistency between the retrieval pair and the generation task).
- Heterogeneous Data: Standard open-source QA datasets to maintain generalization.

D. Inference Modes

CREM supports three inference modes (Fig. 3 in the paper):

Retrieval: Pools the Chorus Tokens to create the final embedding vector.
Native Generation: Uses full visual tokens (standard MLLM behavior).
Compressed Generation: Uses only the Chorus Tokens as the visual context. This significantly reduces the KV cache size and context length, enabling efficient long-context inference.

3. Key Contributions

Unified Framework: CREM is the first to seamlessly integrate embedding and generation tasks into a single MLLM without the typical trade-off, proving that generative supervision can actually enhance retrieval representations.
Learnable Chorus Tokens: The introduction of learnable tokens as a semantic primitive that bridges the gap between dense visual features and compact embeddings.
Compression-Aware Attention: A novel attention mask design that enforces information flow through the compressed representation, ensuring the model learns to distill semantics effectively.
Efficiency: The method enables a drastic reduction in KV cache size (up to 80x token reduction in experiments) while retaining 83% of the original response quality, offering practical benefits for deployment.

4. Experimental Results

The model was evaluated on the MMEB (Massive Multimodal Embedding Benchmark) and various comprehension benchmarks (MMB, MMMU, etc.).

Retrieval Performance (MMEB):
- CREM (2B and 7B variants) achieved State-of-the-Art (SOTA) performance on the MMEB benchmark.
- It outperformed specialized embedding models (like VLM2Vec, UniME, mmE5) and models trained solely on retrieval data.
- Notably, the 7B CREM model achieved an overall score of 72.1, surpassing the previous best (mmE5 at 69.8) despite using a smaller backbone (Qwen2-VL 7B vs. Llama-3.2-Vision 11B for mmE5).
Generative Performance:
- Unlike previous embedding-tuned models that suffer catastrophic forgetting, CREM maintained competitive generative performance on benchmarks like MMB, MMVet, and MMMU.
- Models trained only on retrieval (CREMR) showed significant drops in generation quality, whereas CREM (with joint training) matched the performance of the base model.
Compression Efficiency:
- Experiments showed that even with an 80x reduction in tokens (using only 16 chorus tokens vs. ~1280 vision tokens), the model retained 83% of its response quality.
- This confirms that the Chorus Tokens successfully capture the essential semantic information required for both retrieval and comprehension.

5. Significance

Breaking the Trade-off: CREM challenges the prevailing belief that MLLMs must choose between being good at retrieval or good at generation. It demonstrates that these capabilities are complementary and can be enhanced simultaneously through a unified compression objective.
Scalability and Efficiency: By enabling the use of compressed representations for both retrieval and generation, CREM offers a pathway to reduce memory consumption (KV cache) and computational costs in downstream applications, making MLLMs more viable for real-time and long-context scenarios.
Paradigm Shift: The paper suggests that "compression" is not just an efficiency trick but a fundamental mechanism for improving representation quality, where the act of compressing information forces the model to learn more robust and aligned cross-modal features.