ReCQR: Incorporating conversational query rewriting to improve Multimodal Image Retrieval

Imagine you are trying to find a specific photo in a massive, chaotic digital photo album using a very smart but slightly literal robot librarian.

The Problem: The "That Thing" Dilemma

In a normal conversation, humans are great at being vague because we share context.

You: "Did you see the soccer match yesterday?"
Librarian: "Yes!"
You: "Can you send me a pic of that scene on a cloudy day?"

To you, "that scene" is obvious. But to the robot librarian, it's a mystery. It doesn't know which scene you mean. If you just type "pic of that scene on a cloudy day" into a search engine, the robot will get confused and show you random cloudy pictures, missing the soccer player you actually wanted.

Existing search tools are like librarians who only read the last thing you said, ignoring the whole conversation history. They struggle with long, messy chats and vague references like "that," "it," or "the one we just saw."

The Solution: The "Translator" (ReCQR)

The authors of this paper built a new system called ReCQR (Retrieval-Oriented Conversational Query Rewriting). Think of ReCQR as a super-smart translator sitting between you and the robot librarian.

Here is how it works:

You speak naturally: You say, "Can you send me a pic of that scene on a cloudy day?"
The Translator listens to the whole chat: It remembers you were talking about a soccer match yesterday.
The Translator rewrites your request: It turns your vague sentence into a perfect, self-contained instruction for the robot: "Send me a picture of a soccer player heading the ball on a cloudy day."
The Robot searches: Now, the robot understands exactly what you want and finds the perfect photo.

Building the Training School (The Dataset)

To teach this translator how to do its job, the researchers had to create a massive "training school." They couldn't just ask people to write these sentences because it takes too long. Instead, they used AI (Large Language Models) to do the heavy lifting:

The Factory: They took thousands of images (like pictures of kitchens or soccer fields) and used AI to imagine a fake conversation about them.
The Editor: They used a "Judge AI" to look at the fake conversations and say, "This rewrite is good," or "This one is confusing, throw it out."
The Human Touch: Finally, real humans reviewed the best ones to make sure they sounded natural and accurate.

The result is a dataset called ReCQR, containing 7,000 examples of messy, real-world conversations paired with their "perfectly rewritten" versions. It's like a textbook teaching the AI how to turn "that thing" into "the red bicycle with the broken wheel."

The Results: Does it Work?

The researchers tested this system against the best search tools currently available.

Without the Translator: When users asked vague questions, the search engine got it wrong almost all the time (like a 3% success rate).
With the Translator: The success rate jumped significantly (up to nearly 20-30% in the best cases).

The Big Surprise:
They found that while the AI is great at rewriting text-only conversations, it gets a little confused when the conversation involves multiple images (e.g., "Show me the kitchen, but make it look like that living room"). However, even with this difficulty, the system still performed much better than trying to search without rewriting.

The Takeaway

This paper shows that to make image search truly conversational, we can't just feed the search engine the last sentence you typed. We need a "middleman" that understands the whole story, fixes the vague references, and translates your human thoughts into a clear command the computer can understand.

It's the difference between shouting "Find that thing!" at a confused robot and politely saying, "Please find the blue vase on the shelf we discussed earlier." The robot does a much better job with the second approach.

1. Problem Statement

Multimodal image retrieval aims to locate target images based on natural language queries. While models like CLIP excel at single-turn retrieval, they struggle in conversational settings due to:

Context-Dependency: User queries in multi-turn dialogues often contain ambiguous references (e.g., "that scene," "the one on the left") or ellipses that are unintelligible without the dialogue history.
Noise in Existing Approaches: Current Conversational Image Retrieval (CIR) methods often attempt to encode the entire dialogue history alongside the current query. This introduces redundancy and noise, complicating the retrieval process rather than clarifying the intent.
The Gap: While Conversational Query Rewriting (CQR) is effective in text-only domains, its application to multimodal image retrieval remains largely unexplored. There is a lack of datasets and benchmarks to evaluate how rewriting ambiguous, context-dependent queries into self-contained, retrieval-friendly queries can improve image search performance.

2. Methodology

The authors propose a comprehensive framework involving dataset construction, task formulation, and experimental evaluation.

A. Dataset Construction: ReCQR

The core contribution is the ReCQR dataset, the first benchmark specifically for conversational query rewriting in image retrieval. It contains 7,000 high-quality multi-turn dialogues constructed via a two-stage pipeline:

Stage One (Text-Only Dialogues):
- Samples 6,000 images from MSCOCO.
- Uses LLMs (Qwen2.5-VL, Qwen3) to generate captions, target queries, and realistic dialogue histories.
- Simulates query ellipsis by removing information inferable from the history, creating context-dependent "Original Queries" ($Oq$) that need rewriting.
- Output: Single-image dialogues ( $D_{Text-Only}$ ).
Stage Two (Multimodal Dialogues):
- Samples 12,000 additional images and pairs them with Stage One images based on semantic relevance (verified via BLIP captions, spaCy entity extraction, and ConceptNet).
- Constructs dialogues where users reference multiple images across turns.
- Output: Multi-image dialogues ( $D_{multimodal}$ ).
Quality Control:
- Automated Filtering: GPT-4 scores triplets (Dialogue, Original Query, Target Query) on coherence and reconstructability.
- Human Verification: Manual review by annotators with a tie-breaker mechanism ensures natural flow and precise visual grounding.
- Final Split: 4,000 Text-Only instances and 3,000 Multimodal instances.

B. Task Formulation

The ReCQR task is defined as generating a rewritten query $\hat{q}$ given a dialogue history $H$ and an original query $Oq$:
$\hat{q} = F(H, Oq)$
The goal is to resolve coreferences and ellipses, incorporating visual context to produce a self-contained query suitable for off-the-shelf retrievers (e.g., CLIP).

C. Experimental Setup

Models: Three Multimodal Large Language Models (MLLMs) were tested for the rewriting task: Qwen2.5-VL-7B, LLaVA-v1.6-Mistral-7B, and GLM-4.1V-9B.
Retrieval Backbone: A fixed CLIP-ViT-B/32 model was used to encode rewritten queries and images. Performance is measured by how well the rewritten query retrieves the target image.
Training Settings:
- Text-Only (T): Models fine-tuned on text history only.
- Multimodal (M): Models fine-tuned on text + historical images to leverage visual grounding.

3. Key Contributions

Extension of CQR to Multimodal Domain: The paper successfully adapts the concept of conversational query rewriting from text-only search to the complex multimodal image retrieval domain.
ReCQR Dataset: Construction of a large-scale, high-quality dataset (7K dialogues) featuring both single-image and multi-image contexts, generated via an LLM-as-Judge pipeline and human verification.
Comprehensive Benchmark: Establishment of a rigorous evaluation framework demonstrating that query rewriting significantly bridges the gap between ambiguous user intent and static retrieval models.

4. Experimental Results

The study evaluated models using Recall@K (R@1, R@5, R@10) metrics.

Necessity of Rewriting: There is a massive performance gap between the Original Query (R@1 $\approx$ 3.2–3.6) and the Target Query (Oracle, R@1 $\approx$ 20.4–22.4). This confirms that raw conversational queries are ineffective for retrieval without rewriting.
Impact of Fine-Tuning: Fine-tuning LLMs on the ReCQR dataset yielded substantial gains over zero-shot baselines. For example, fine-tuned Qwen2.5-VL improved R@1 from 13.6 to 19.2 in the Text-Only setting.
Text-Only vs. Multimodal:
- Text-Only Setting: Fine-tuned models performed well, but adding multimodal fine-tuning sometimes caused "catastrophic forgetting," slightly reducing text-only performance.
- Multimodal Setting: Visual grounding is critical. Models trained with image history (M) significantly outperformed text-only versions (T) on multi-image tasks (e.g., LLaVA-v1.6 M: R@1=13.2 vs T: R@1=7.6).
Model Performance:
- GLM-4.1V-9B-Thinking excelled in single-image text-only scenarios (highest R@1).
- LLaVA-v1.6-Mistral-7B-HF and GLM-4.1V-9B-Thinking showed complementary strengths in multimodal settings, with LLaVA leading in R@1 and GLM in R@5/R@10.

5. Significance

Practical Application: The work provides a direct pathway to leverage powerful, off-the-shelf retrievers (like CLIP) in dynamic, multi-turn conversational systems without needing to retrain the retriever itself.
System Design Insight: It demonstrates that query rewriting is a more effective strategy than history encoding for multimodal retrieval, as it produces concise, intent-rich representations that reduce noise.
Future Direction: The ReCQR dataset and benchmark establish a new standard for developing context-aware multimodal dialogue systems, highlighting the importance of resolving cross-modal dependencies in user queries.