ReCQR: Incorporating conversational query rewriting to improve Multimodal Image Retrieval

This paper introduces ReCQR, a framework that leverages conversational query rewriting to transform ambiguous, multi-turn user queries into concise, semantically complete prompts, thereby significantly improving the accuracy of multimodal image retrieval systems through a newly constructed high-quality dataset and comprehensive benchmarking.

Yuan Hu, ZhiYu Cao, PeiFeng Li, QiaoMing Zhu

Published 2026-03-31
📖 4 min read☕ Coffee break read

Imagine you are trying to find a specific photo in a massive, chaotic digital photo album using a very smart but slightly literal robot librarian.

The Problem: The "That Thing" Dilemma

In a normal conversation, humans are great at being vague because we share context.

  • You: "Did you see the soccer match yesterday?"
  • Librarian: "Yes!"
  • You: "Can you send me a pic of that scene on a cloudy day?"

To you, "that scene" is obvious. But to the robot librarian, it's a mystery. It doesn't know which scene you mean. If you just type "pic of that scene on a cloudy day" into a search engine, the robot will get confused and show you random cloudy pictures, missing the soccer player you actually wanted.

Existing search tools are like librarians who only read the last thing you said, ignoring the whole conversation history. They struggle with long, messy chats and vague references like "that," "it," or "the one we just saw."

The Solution: The "Translator" (ReCQR)

The authors of this paper built a new system called ReCQR (Retrieval-Oriented Conversational Query Rewriting). Think of ReCQR as a super-smart translator sitting between you and the robot librarian.

Here is how it works:

  1. You speak naturally: You say, "Can you send me a pic of that scene on a cloudy day?"
  2. The Translator listens to the whole chat: It remembers you were talking about a soccer match yesterday.
  3. The Translator rewrites your request: It turns your vague sentence into a perfect, self-contained instruction for the robot: "Send me a picture of a soccer player heading the ball on a cloudy day."
  4. The Robot searches: Now, the robot understands exactly what you want and finds the perfect photo.

Building the Training School (The Dataset)

To teach this translator how to do its job, the researchers had to create a massive "training school." They couldn't just ask people to write these sentences because it takes too long. Instead, they used AI (Large Language Models) to do the heavy lifting:

  • The Factory: They took thousands of images (like pictures of kitchens or soccer fields) and used AI to imagine a fake conversation about them.
  • The Editor: They used a "Judge AI" to look at the fake conversations and say, "This rewrite is good," or "This one is confusing, throw it out."
  • The Human Touch: Finally, real humans reviewed the best ones to make sure they sounded natural and accurate.

The result is a dataset called ReCQR, containing 7,000 examples of messy, real-world conversations paired with their "perfectly rewritten" versions. It's like a textbook teaching the AI how to turn "that thing" into "the red bicycle with the broken wheel."

The Results: Does it Work?

The researchers tested this system against the best search tools currently available.

  • Without the Translator: When users asked vague questions, the search engine got it wrong almost all the time (like a 3% success rate).
  • With the Translator: The success rate jumped significantly (up to nearly 20-30% in the best cases).

The Big Surprise:
They found that while the AI is great at rewriting text-only conversations, it gets a little confused when the conversation involves multiple images (e.g., "Show me the kitchen, but make it look like that living room"). However, even with this difficulty, the system still performed much better than trying to search without rewriting.

The Takeaway

This paper shows that to make image search truly conversational, we can't just feed the search engine the last sentence you typed. We need a "middleman" that understands the whole story, fixes the vague references, and translates your human thoughts into a clear command the computer can understand.

It's the difference between shouting "Find that thing!" at a confused robot and politely saying, "Please find the blue vase on the shelf we discussed earlier." The robot does a much better job with the second approach.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →