Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

Imagine you are shopping online for a dress. You find one you like, but you want to make a few changes: "I want this dress, but in blue, with no stripes, and made of linen."

In the past, asking a computer to find that exact item was like trying to explain a complex dream to a friend who only speaks in vague summaries. If you said, "Blue dress, no stripes," the computer might get confused. It might forget the original shape of the dress, or it might show you ten identical blue dresses that all look exactly the same, leaving you bored.

This paper introduces Pix2Key, a new way to talk to image-search computers that is smarter, more precise, and gives you more variety. Here is how it works, using some everyday analogies.

1. The Problem: The "Blurry Summary" vs. The "Detailed Checklist"

Older systems tried to understand your request by turning your reference image and your text into a single, long sentence (a "caption").

The Analogy: Imagine you are describing a crime scene to a detective. Instead of giving a list of specific details (a red hat, a broken watch, a muddy shoe), you just say, "It was a messy scene with some red and broken things." The detective (the computer) has to guess what matters. They might miss the red hat because they focused on the mud.

Pix2Key changes the game. Instead of a blurry summary, it turns every image into a structured visual dictionary—like a detailed checklist or a recipe card.

The Analogy: Now, when you look at the dress, the computer doesn't just "see" a dress. It reads a card that says:
- Color: Rose-pink
- Pattern: Stripes
- Material: Silk
- Neckline: Halter

When you say, "I want it blue, no stripes," the computer doesn't guess. It simply crosses out "Stripes" and writes in "Blue" on the checklist. It knows exactly what to keep, what to change, and what to ignore.

2. The Magic Ingredient: The "Self-Taught Art Critic" (V-Dict-AE)

One of the paper's biggest innovations is a component called V-Dict-AE. This is a special training method that helps the computer get better at reading those checklists without needing a human teacher to correct it every time.

The Analogy: Imagine a student artist who wants to learn how to describe paintings perfectly. Instead of having a teacher grade them, the student is given a painting, asked to write a description, and then asked to re-draw the painting based only on that description.
- If the student forgets to mention the "blue sky" in their description, they can't draw the blue sky when they try to recreate the image.
- This forces the student to pay attention to the tiny, important details (like the specific shade of blue or the texture of the clouds) just to get the drawing right.

Pix2Key uses this "re-drawing" trick to teach the computer to spot fine-grained details (like "sleeve length" or "fabric texture") automatically. This means it understands your request much better, even if you haven't taught it specifically for that task before.

3. The "Diversity Filter": Avoiding the "Clone Army"

A common problem with image search is that if you find one good result, the next 10 results look exactly the same. It's like walking into a store where every rack has the exact same shirt.

Pix2Key includes a Diversity-Aware Reranking step.

The Analogy: Imagine you ask a travel agent for "a beach vacation." They could book you 10 trips to the exact same beach. Instead, Pix2Key acts like a smart agent who says, "Okay, here are three great beaches that fit your budget and style, but they all look different from each other."
It balances Relevance (Does it match your request?) with Variety (Is it different from the others?). This ensures you get a list of options that are all correct but offer different styles, colors, or cuts, giving you a better shopping experience.

4. Why This Matters

For Shoppers: You can find exactly what you want without sifting through hundreds of identical items or getting results that ignore your specific requests (like "no stripes").
For Designers: You can quickly find variations of a layout or scene without starting from scratch.
For the Future: This system doesn't need thousands of human-labeled examples to learn. It teaches itself by looking at images and trying to reconstruct them, making it cheaper and faster to build for new types of searches.

In a Nutshell

Pix2Key is like upgrading from a fuzzy, guessing game to a precise, checklist-based conversation with a computer. It breaks images down into clear facts, learns to spot tiny details on its own, and makes sure the results you get are not only correct but also interesting and varied. It turns "I want a dress like this but different" into a clear, actionable instruction that a computer can follow perfectly.

1. Problem Definition

Composed Image Retrieval (CIR) is a multimodal search task where a query consists of a reference image and a natural-language edit (e.g., "like this dress, but in blue"). The goal is to retrieve images that apply the requested change while preserving other relevant visual content.

Limitations of Existing Approaches:

Supervised Triplet Methods: Rely on expensive, large-scale datasets of (reference, edit, target) triplets. They often learn a single fused representation that implicitly decides which attributes to preserve, leading to a loss of fine-grained cues and lack of transparency.
Zero-Shot Captioning Methods: Recent approaches use Vision-Language Models (VLMs) to caption the reference image and rewrite the caption based on the edit. This creates a "lossy bottleneck" where subtle details (e.g., neckline shape, specific patterns) are lost when collapsing an image into a single sentence. Furthermore, ranking by a single fused embedding often yields homogeneous results (near-duplicates), lacking diversity.
Evaluation Gaps: Existing benchmarks often lack fine-grained attribute labels, making it difficult to evaluate how well a retrieved list satisfies specific constraints beyond just hitting a single labeled target.

2. Methodology: Pix2Key

Pix2Key addresses these issues by representing both queries and database images as Open-Vocabulary Visual Dictionaries rather than fused embeddings or single sentences. The framework consists of three main components:

A. Open-Vocabulary Visual Dictionaries

Instead of generating a free-form caption, the system extracts structured attribute facts from images.

Representation: An image $I$ is converted into a dictionary $D_{img}(I) = \{(k_m, v_m)\}$ , where $k_m$ is an attribute key (e.g., "Color") and $v_m$ is its value (e.g., "Blue").
Query Decomposition: A composed query $(I_q, T)$ $(I_{q}, T)$ is processed by:
1. Extracting the reference dictionary $D_{ref}$ .
2. Decomposing the edit text $T$ $T$ into signed updates $\Delta D(T) = \{(k, v, p)\}$ $Δ D (T) = {(k, v, p)}$ , where the polarity $p \in \{+1, 0, -1\}$ $p \in {+ 1, 0, - 1}$ indicates:
  - $+1$ (Add/Strengthen): Desired attributes.
  - $-1$ (Remove/Avoid): Attributes to explicitly suppress.
  - $0$ (Open Anchor): Attributes to preserve from the reference but not explicitly constrained.
Merging: The final query dictionary merges the reference and edits, where edits override conflicting reference entries, and negative entries act as explicit constraints.

B. Text-Space Indexing and Intent-Aware Scoring

Indexing: Dictionary entries are serialized into short strings (e.g., key:value; key:value) and embedded using a frozen text encoder (OpenCLIP). This allows efficient nearest-neighbor search in a unified text space.
Relevance Scoring: The system computes three similarity scores between the query subsets ( $q^+, q^0, q^-$ $q^{+}, q^{0}, q^{-}$ ) and a candidate $e_i$ $e_{i}$ :
- $p_i$ : Similarity to desired attributes.
- $o_i$ : Similarity to open anchors (preservation).
- $n_i$ : Similarity to forbidden attributes.
Final Score: A scalar relevance score is calculated as:
$R(i) = \alpha p_i + \beta o_i - (1-\alpha) n_i$
This allows explicit control over enforcing changes, preserving context, and suppressing unwanted features.

C. Diversity-Aware Reranking

To prevent near-duplicate results, Pix2Key applies a Maximal Marginal Relevance (MMR) reranking step. It selects candidates that maximize a trade-off between relevance ( $R(i)$ ) and distance from already selected items, controlled by a user-facing parameter $\lambda$ .

D. V-Dict-AE (Self-Supervised Pretraining)

To improve the quality of the extracted dictionary tokens without CIR-specific supervision, the authors introduce V-Dict-AE.

Mechanism: A lightweight autoencoder is trained to encode images into compact token sequences (slots) that can reconstruct the image via a frozen diffusion decoder.
Training: It uses only images (no text edits required). The model learns to preserve fine-grained visual evidence necessary for reconstruction.
Integration: The trained slot encoder replaces the standard patch embeddings in the VLM during inference, resulting in more accurate attribute extraction for the dictionary.

3. Key Contributions

Pix2Key Framework: A training-free CIR system that uses structured visual dictionaries with explicit intent polarity ( $+, -, 0$ ), enabling controllable and interpretable constraint matching.
Diversity-Aware Reranking: A mechanism integrated with the dictionary representation to balance strict constraint satisfaction with result variety.
V-Dict-AE: A self-supervised visual-dictionary autoencoder that refines dictionary representations using only image reconstruction, eliminating the need for costly CIR triplets.
DFMM-Compose Benchmark: A new evaluation benchmark derived from DeepFashion-MM that includes fine-grained attribute labels. It enables quantitative evaluation of Intent Satisfaction (Attribute Consistency) and List Diversity (Intra-List Diversity), addressing gaps in existing benchmarks.

4. Experimental Results

The method was evaluated on FashionIQ, CIRR, and the new DFMM-Compose benchmark.

Retrieval Accuracy:
- On FashionIQ, Pix2Key outperforms unimodal baselines and training-free caption-rewrite methods (like CIReVL).
- Pix2Key + V-Dict-AE achieves the best performance across all categories, improving Recall@10 by up to 3.2 points over baselines on FashionIQ and 2.3 points on CIRR.
Intent Consistency (AC@50):
- On DFMM-Compose, Pix2Key significantly outperforms baselines in Attribute Consistency (AC@50), demonstrating that the dictionary approach better captures fine-grained edits than single-sentence captions.
Diversity (ILD@50):
- Pix2Key achieves the highest Intra-List Diversity (ILD@50), proving that the dictionary representation combined with MMR reranking yields diverse results without sacrificing relevance.
Ablation Studies:
- Using negative constraints (avoiding attributes) is crucial for separating intent-violating candidates.
- V-Dict-AE pretraining significantly boosts performance, confirming that self-supervised reconstruction helps preserve fine-grained visual details.
- The system is robust to hyperparameter changes (learning rate, pooler depth) but benefits from higher input resolutions and LoRA adapters.

5. Significance and Impact

Controllability: By decomposing edits into explicit "add," "avoid," and "keep" constraints, Pix2Key offers a transparent and user-controllable interface for retrieval, unlike "black box" fused embeddings.
Scalability: The approach is training-free regarding the CIR task itself (no triplet supervision needed) and leverages self-supervised pretraining, making it scalable to new domains without expensive annotation.
Evaluation Standard: The introduction of DFMM-Compose shifts the evaluation paradigm from simple "hit rate" to a more holistic assessment of intent satisfaction and result diversity, which is critical for real-world applications like e-commerce and creative design.
Practical Application: The system is particularly well-suited for scenarios requiring precise, localized modifications (e.g., "same dress, different fabric, no stripes") where existing zero-shot methods often fail due to information loss.