DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

Imagine you are shopping for a specific outfit. You find a photo of a pink short-sleeved shirt (the reference image), but you don't want that exact one. You want a blue one with white letters on it. You type this into a search bar: "Change the shirt to blue and add white letters."

This is the job of Composed Image Retrieval (CIR). It's like a smart assistant that takes a picture and a text instruction to find a new, slightly different picture.

However, current assistants are a bit clumsy. They often get confused or throw away good options because they are too strict. This paper, DQE-CIR, introduces a smarter assistant that solves these two main problems:

1. The Problem: The "All-or-Nothing" Search

Imagine you ask the assistant for a "blue shirt."

The Old Way: The assistant looks at a "green shirt" and thinks, "That's not blue! It's a failure!" So, it pushes the green shirt to the very bottom of the list, even though it's still a nice shirt that matches most of your request. This is called Relevance Suppression. It throws away good candidates just because they aren't perfect.
The Confusion: If you ask for a "blue shirt" and then later ask for a "red shirt," the old assistant might put both answers in the same messy pile in its brain. It can't tell the difference between "blue" and "red" very well. This is Semantic Confusion.

2. The Solution: DQE-CIR (The Smart Assistant)

The authors propose a new method called DQE-CIR. Think of it as upgrading the assistant with two superpowers:

Superpower A: The "Highlighter Pen" (Learnable Attribute Weights)

When you say, "Make it blue," the old assistant might look at the whole shirt equally. The new assistant uses a Learnable Attribute Weight.

Analogy: Imagine you are a teacher grading a test. The old teacher gives the same weight to every question. The new teacher uses a highlighter pen. When you say "blue," the teacher highlights the word "blue" in your request and says, "Okay, I will pay extra attention to the color part of this shirt and ignore the rest for a moment."
Result: The assistant knows exactly which part of the image to change (the color) and which part to keep (the shape), making the search much more precise.

Superpower B: The "Goldilocks" Filter (Target Relative Negative Sampling)

In machine learning, the computer learns by looking at "wrong" answers (negatives) to understand what not to pick.

The Old Way: The computer looks at a "red car" when you want a "blue car." It thinks, "Haha, that's totally wrong!" (Too easy). Then it looks at a "blue truck" and thinks, "That's also totally wrong!" (Too confusing, because it's blue but the wrong shape). It treats both as equally bad, which confuses the learning process.
The New Way (Goldilocks): The new assistant only picks the "Just Right" wrong answers.
- It ignores the "red car" (too easy).
- It ignores the "blue truck" (too confusing/false negative).
- It picks a "blue sedan" that is very similar to your target but has the wrong wheels. This is the Mid-Zone.
Analogy: Imagine you are training a dog to fetch a ball.
- If you throw a rock, the dog knows "That's not a ball" (Too easy).
- If you throw a ball that looks exactly like the target but is slightly different, the dog has to think hard: "Is this the one? No, the color is slightly off."
- DQE-CIR only uses these "hard but fair" examples to train the model. This stops the model from getting confused and helps it learn the subtle differences.

3. The Result: A Sharper Search

By combining the Highlighter Pen (focusing on specific details like color) and the Goldilocks Filter (learning from the right kind of mistakes), the new system:

Doesn't throw away good options: It keeps shirts that are "close enough" in the top results.
Tells the difference: It can clearly separate a "short-sleeved blue shirt" from a "long-sleeved blue shirt."
Works better in real life: Whether you are looking for fashion items or specific scenes in a video, it finds exactly what you asked for, even if the changes are tiny.

Summary

Think of DQE-CIR as upgrading a search engine from a clumsy librarian who throws away books that are "almost right" and gets confused by similar titles, to a sharp-eyed expert who knows exactly which details matter and learns by studying the "almost right" examples that teach the most. The result? You get the perfect blue shirt with white letters, every time.

1. Problem Definition

Composed Image Retrieval (CIR) is the task of retrieving a target image from a database given a multimodal query consisting of a reference image and a modification text (e.g., "change the red dress to a blue one").

The paper identifies two critical limitations in existing CIR methods, which predominantly rely on standard contrastive learning frameworks:

Relevance Suppression: In standard contrastive learning, the ground-truth target is the only positive, and all other images are treated as negatives. This incorrectly penalizes images that are semantically relevant (e.g., they share the correct color but wrong sleeve length) but are not the exact ground truth. These "false negatives" are pushed away in the embedding space, degrading retrieval quality for fine-grained attributes.
Semantic Confusion: Different modification intents (e.g., "make it red" vs. "make it blue") often collapse into overlapping regions of the embedding space. This lack of discriminativeness makes it difficult to distinguish between subtle attribute changes, leading to confusion between visually similar candidates.

2. Methodology: DQE-CIR

The authors propose DQE-CIR, a framework designed to learn distinctive query embeddings by explicitly modeling target relative relevance. The method consists of three core components:

A. Learnable Attribute Weights

To address the need for fine-grained attribute sensitivity, the model enhances the query representation using the BLIP-2 backbone (specifically the Q-Former).

Mechanism: The model extracts specific sub-query features from the Q-Former attention outputs conditioned on attribute-related terms in the text (e.g., color and shape).
Formulation: The final query embedding ( $q^*$ ) is a weighted sum of the base query ( $q$ ) and attribute-specific sub-queries ( $q_{color}, q_{shape}$ ):
$q^* = q + w_{color} \cdot q_{color} + w_{shape} \cdot q_{shape}$
Function: The weights ( $w_{color}, w_{shape}$ ) are learnable scalars that adaptively emphasize the most critical visual features based on the modification text, ensuring the query aligns precisely with the intended change.

B. Target Relative Negative Sampling (TRNS)

Instead of treating all non-target images as negatives, TRNS constructs a target relative similarity distribution to select informative negatives.

$\Delta$ -Score Calculation: For a candidate image $j$ , the model computes the similarity gap between the target image and the candidate:
$\Delta S_j = \text{similarity}(q^*, v_{target}) - \text{similarity}(q^*, v_{candidate})$
Mid-Zone Selection: The method defines a "mid-zone" range $[\alpha, \beta]$ $[α, β]$ for $\Delta S_j$ $Δ S_{j}$ .
- Excluded: Candidates with very large $\Delta S$ (too easy/negligible) and those with $\Delta S \approx 0$ (false negatives/ambiguous).
- Selected: Candidates in the mid-zone are "informative negatives"—they are challenging enough to provide supervision but distinct enough to be valid negatives.
Single-Negative Pairwise Learning: From the mid-zone, a single negative is sampled for each query. The model then optimizes a pairwise ranking loss between the target and this specific negative, rather than a multi-negative contrastive loss.

C. Learning Objective

The training objective combines three losses:

KL Divergence Loss: Encourages the composed query distribution to match the target distribution.
Main Pairwise Ranking Loss: Enforces a margin between the target similarity and the sampled negative similarity ( $L_{main} = \max(0, m - s_{tar} + s_{neg})$ ).
Attribute-Aware Auxiliary Losses: Additional margin losses applied to the color and shape sub-queries to ensure specific attributes are correctly distinguished.

3. Key Contributions

Distinctive Query Embeddings: Introduction of Learnable Attribute Weights that dynamically adjust the importance of visual features (color, shape) based on the textual modification, leading to more precise feature alignment.
Target Relative Negative Sampling (TRNS): A novel sampling strategy that defines a $\Delta$ -score based mid-zone. This effectively filters out false negatives and overly easy negatives, focusing training on semantically informative samples.
Pairwise Learning Paradigm: Shifting from standard contrastive learning to a single-negative pairwise ranking approach, which creates a clearer preference ordering and reduces semantic confusion.
Comprehensive Validation: Extensive experiments demonstrating that these components work synergistically to improve both coarse retrieval and fine-grained attribute alignment.

4. Experimental Results

The authors evaluated DQE-CIR on standard benchmarks: FashionIQ (object-centric, fine-grained attributes) and CIRR (scene-centric, diverse modifications), as well as CIRCO for zero-shot performance.

FashionIQ: DQE-CIR achieved state-of-the-art (SOTA) results across all categories (Dress, Shirt, Toptee).
- Example: On the Dress category, it achieved 48.47 Recall@10 and 71.09 Recall@50, outperforming the previous best (QuRe) by ~2 points.
CIRR: The method showed significant gains in both global ranking and subset-level discriminativeness (where candidates are visually similar).
- Example: Achieved 54.05 Recall@1 and 80.14 Recallsubset@1, surpassing QuRe (52.22 and 78.51 respectively).
Zero-Shot Performance (CIRCO): In zero-shot settings (no dataset-specific training), DQE-CIR achieved the highest mAP scores (24.27 mAP@5), demonstrating strong generalization capabilities.
Ablation Studies:
- Mid-Zone Sensitivity: A 60% mid-zone range ( $\alpha=0.20, \beta=0.80$ ) yielded optimal performance, confirming the importance of balancing negative difficulty.
- Backbone Independence: The improvements held true even when using the same BLIP backbone as competitors, proving the gains come from the proposed method, not just the architecture.
- Visualizations: Cross-attention maps confirmed that the model focuses on specific attribute regions (e.g., sleeve length, color) rather than global image features.

5. Significance

DQE-CIR represents a significant shift in how CIR models handle negative samples and attribute alignment. By moving away from the "all-or-nothing" negative sampling of contrastive learning, it solves the relevance suppression problem, allowing semantically similar but non-target images to remain in the retrieval space without being penalized. Simultaneously, the attribute weighting mechanism ensures that subtle modifications (e.g., "red" vs. "blue") are distinctly represented.

This work is significant for real-world applications like fashion search and product recommendation, where users expect precise control over multiple attributes. The proposed framework offers a robust, unified solution that improves retrieval accuracy, reduces semantic confusion, and enhances the interpretability of the retrieval process.