Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

This paper proposes a quality-controllable text-to-image retrieval framework that leverages generative language models to enrich short, ambiguous user queries with contextual details and explicit quality constraints, thereby improving retrieval performance and enabling steerable, transparent results without modifying existing vision-language models.

Jianglin Lu, Simon Jenni, Kushal Kafle, Jing Shi, Handong Zhao, Yun Fu

Published 2026-02-25
📖 4 min read☕ Coffee break read

Imagine you are walking into a massive, endless library of photos. You want to find a picture of a "dog."

The Problem:
You shout, "Dog!" to the librarian.
Because the library has millions of dogs, the librarian pulls out a chaotic pile: a sleeping puppy, a fierce guard dog, a dog in a costume, a blurry photo of a dog, and a high-definition, artistic portrait of a dog.
You wanted something specific (maybe a cute, high-quality photo), but because your request was so short ("dog"), the librarian had to guess. You get a mix of everything, and you have to sift through the bad ones to find the good one.

The Paper's Solution:
This paper proposes a new way to talk to the librarian. Instead of just shouting "Dog," you use a smart assistant (an AI language model) to help you finish your sentence before you ask.

Here is how it works, broken down with simple analogies:

1. The "Magic Translator" (Query Completion)

Think of your short query ("dog") as a rough sketch. The system uses a Generative AI (like a very smart writer) to turn that sketch into a detailed painting description.

  • You say: "Dog."
  • The AI thinks: "Okay, the user wants a dog. But what kind of dog? And how good should the picture look?"
  • The AI expands your request to: "A golden retriever puppy playing in a sunny park, with soft lighting and a professional photography style."

2. The "Quality Dial" (Controllability)

This is the paper's secret sauce. Usually, AI just tries to make the description longer. But this system adds a dial that lets you choose the quality of the result.

Imagine the AI has a menu of options for you to pick from:

  • Setting A (Low Quality): The AI expands "dog" to: "A blurry, grainy photo of a dog seen through a fence." (Maybe you just want a quick, rough idea).
  • Setting B (Medium Quality): The AI expands it to: "A dog running in a park, taken with a standard camera."
  • Setting C (High Quality): The AI expands it to: "A stunning, high-definition portrait of a dog with golden hour lighting, sharp focus, and artistic composition."

You can tell the system, "I want Setting C," and it will rewrite your request specifically to find those high-quality images.

3. Why This is Better Than Just Filtering

You might ask, "Why not just search for 'dog' and then throw away the bad pictures?"

  • The Old Way (Filtering): Imagine the librarian brings you 1,000 photos of dogs. You have to look at all 1,000 to find the 5 good ones. It's slow and frustrating.
  • The New Way (Quality Control): The librarian only brings you the 5 best photos because you told them exactly what you wanted before they started looking.

The Three Superpowers of This System

The authors highlight three main benefits:

  1. Flexibility (The Universal Adapter): It works with any existing photo-search engine. You don't need to rebuild the whole library; you just add this smart "translator" in front of it.
  2. Transparency (The Honest Assistant): The AI doesn't just magically change the results. It shows you the new sentence it wrote ("A golden retriever..."). You can read it, understand it, and say, "Oh, I didn't want a golden retriever, I wanted a poodle," and change it.
  3. Controllability (The Volume Knob): You aren't stuck with whatever the computer thinks is "best." You can turn the knob to "High Aesthetics" for art projects or "High Relevance" if you just need to find a specific object quickly.

The Bottom Line

This paper solves the problem of "short, vague searches" by teaching the computer to ask better questions for you.

Instead of you struggling to describe exactly what you want, the AI acts like a creative writing partner. It takes your one-word idea and expands it into a detailed, high-quality instruction that guarantees the photos you get back are exactly the kind of quality you are looking for. It bridges the gap between your simple thought and the computer's complex database.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →