IntRec: Intent-based Retrieval with Contrastive Refinement

Imagine you are playing a game of "20 Questions" with a very smart, but slightly literal, robot assistant. You want it to find a specific object in a messy room, but your instructions are a bit vague.

The Problem: The "One-Shot" Robot

Most current AI systems are like a robot that only gets one guess.

You say: "Find the red umbrella."
The Robot: Looks around, sees three red umbrellas, and immediately picks one.
The Problem: If you wanted the smaller one with the floral pattern, the robot gets it wrong. It can't go back and say, "Oh, I misunderstood, let me try again." It just gives you the wrong answer and moves on.

The Solution: IntRec (The "Memory-Keeping" Robot)

The paper introduces a new system called IntRec. Think of IntRec not as a robot that guesses once, but as a detective with a whiteboard.

Here is how it works, using simple analogies:

1. The "Intent State" (The Detective's Whiteboard)

Instead of just remembering your original question, IntRec keeps a running list on a whiteboard called the Intent State. This board has two columns:

The "Yes" Column (Positive Anchors): Things you want. (e.g., "Red," "Umbrella," "Floral pattern").
The "No" Column (Negative Constraints): Things you don't want. (e.g., "Not the big one," "Not the plain one").

2. The Interaction Loop (The Game of Refinement)

Here is the step-by-step process of how IntRec solves the problem:

Round 1 (The First Guess): You say, "Find the red umbrella." IntRec looks at the room and picks the best match. Let's say it picks the big plain one.
The Feedback: You say, "No, that's the wrong one. I want the small one with flowers."
The Update (The Magic Step):
- IntRec takes the "Big Plain Umbrella" and writes it in the "No" Column. It now knows to avoid anything that looks like that.
- It takes your new clue ("Small," "Flowers") and adds them to the "Yes" Column.
Round 2 (The Correction): IntRec looks at the room again. It sees the three umbrellas.
- It looks at the big plain one: "Oh, that's in the 'No' column! I must ignore it."
- It looks at the small floral one: "That matches the 'Yes' column and doesn't match the 'No' column!"
- Result: It points to the correct umbrella.

Why is this special? (The "Contrastive" Secret)

The paper uses a fancy term called "Contrastive Refinement." In plain English, this means learning by elimination.

Imagine you are looking for a specific person in a crowded stadium.

Old AI: Points to the first person who looks somewhat like the description.
IntRec: Points to a person. You say, "No, that's not him." IntRec doesn't just forget that person; it actively penalizes that look. It effectively says, "Okay, I will now lower the score for anyone who looks like that person."

This allows the AI to distinguish between two things that look almost identical (like two similar red umbrellas) by using your "No" feedback to push the wrong one down the list and the right one up.

The Results: Fast and Accurate

The researchers tested this on huge datasets with thousands of objects.

Accuracy: It got significantly better at finding the exact object you wanted, especially when there were many confusing, similar objects around.
Speed: It's incredibly fast. Adding this "conversation" step only takes about 30 milliseconds (less than the time it takes to blink). It's like having a super-fast assistant who can think, "Wait, that's not it," and correct itself instantly.

The Bottom Line

IntRec changes how AI talks to us. Instead of being a rigid machine that gives one answer and stops, it becomes a collaborative partner. It listens to your corrections, remembers what you rejected, and uses that memory to find exactly what you are looking for, even in the messiest, most confusing scenes.

In short: It turns "Guess and Check" into "Learn and Refine."

1. Problem Statement

Current open-vocabulary object detectors and visual grounding models typically operate in a one-shot, stateless manner. They take a single text or visual query and return the top-scoring candidate region based on semantic similarity.

The Core Limitation: These systems struggle with ambiguity, particularly in cluttered scenes containing multiple visually similar objects (distractors). For example, a query like "the smaller red car" may yield multiple candidates with nearly identical high similarity scores.
Lack of Feedback: Existing models cannot incorporate user feedback to refine predictions. Once a wrong object is selected, the system has no mechanism to learn from the rejection or update its understanding of the user's specific intent.
Goal: To develop an interactive framework that allows users to iteratively refine retrieval results through positive confirmations and negative rejections, enabling fine-grained disambiguation without additional supervised training.

2. Methodology: The IntRec Framework

The authors propose IntRec, an interactive object retrieval framework centered around a dynamic Intent State (IS) and a Contrastive Alignment mechanism.

A. Intent State (IS) Representation

Unlike traditional models that compress intent into a single embedding vector, IntRec maintains a memory structure that evolves over interaction turns ( $t$ ). The IS is defined as a tuple of two exemplar sets:

Positive Anchors ( $Z_{pos}$ ): A set of embeddings representing confirmed cues (e.g., the initial query, reference images, or user-confirmed target regions).
Negative Constraints ( $Z_{neg}$ ): A set of embeddings representing rejected hypotheses (e.g., regions explicitly marked as "not this" by the user).

The state is initialized at $t=0$ by fusing text and image embeddings from the initial prompt. As the user provides feedback, the IS is updated by adding the feature vectors of rejected or confirmed regions to their respective sets.

B. Contrastive Ranking Function

To rank candidate regions ( $R = \{r_1, ..., r_M\}$ ) at any turn $t$ , IntRec uses a contrastive scoring function that maximizes similarity to positive anchors while penalizing similarity to negative constraints:

$S(r_j | IS_t) = \max_{z^+ \in Z_{pos}^{(t)}} \cos(r_j, z^+) - \lambda \cdot \max_{z^- \in Z_{neg}^{(t)}} \cos(r_j, z^-)$

First Term: Promotes regions that align with any positive exemplar.
Second Term: Penalizes regions that are too similar to rejected exemplars, creating "low-scoring valleys" around distractors in the embedding space.
$\lambda$ : A hyperparameter controlling the weight of the negative penalty.

C. Interactive Loop

Initialization: The model generates candidate regions and scores them using the initial IS.
User Feedback: The user reviews the top- $k$ $k$ candidates.
- If a candidate is the target, the user confirms (Positive).
- If a candidate is a distractor, the user rejects it (Negative).
State Update: The feedback updates the IS (adding the rejected region to $Z_{neg}$ or the confirmed region to $Z_{pos}$ ).
Refinement: The model re-ranks all candidates using the updated IS and the contrastive function. This loop continues until the target is localized.

3. Key Contributions

Interactive Intent Refinement: The paper reframes object retrieval as a stateful, multi-turn learning problem, addressing the ambiguity limitations of one-shot open-vocabulary detectors.
Intent State Module: Introduction of a dual-memory system ( $Z_{pos}$ and $Z_{neg}$ ) that accumulates user feedback, allowing the model to learn not just what the user wants, but also what they do not want.
Contrastive Disambiguation: A novel ranking function that leverages negative constraints to suppress distractors, enabling the model to distinguish between highly similar objects after a single corrective feedback.
Theoretical Guarantee: The authors provide a theoretical analysis proving that their contrastive mechanism can mathematically resolve ambiguity conditions where a distractor initially scores higher than the true target, provided a suitable penalty weight $\lambda$ is used.

4. Experimental Results

The framework was evaluated on large-scale benchmarks, including LVIS, Objects365, and a custom LVIS-Ambiguous subset.

Overall Performance (LVIS):
- IntRec achieved 35.4 AP, outperforming state-of-the-art methods like OVMR (+2.3), CoDet (+3.7), and CAKE (+0.5).
- It showed significant improvements in detecting rare categories (AP(r)).
Ambiguity Resolution (LVIS-Ambiguous):
- This benchmark specifically targets scenes with multiple similar objects.
- While the baseline (Turn-0) scored 14.8 AP, IntRec improved to 22.7 AP after a single round of negative feedback (Turn-1), a +7.9 AP gain.
- This demonstrates a substantial recovery capability compared to baselines (CoDet and OVMR) which showed minimal improvement.
Transfer Learning:
- In zero-shot transfer settings (trained on LVIS/ImageNet-21k, tested on Objects365 and COCO), IntRec showed significant performance boosts after the first feedback turn, particularly for rare categories.
Efficiency:
- The added latency per interaction is minimal (< 30 ms on an NVIDIA RTX 3090), representing less than 15% of total inference time.
Ablation Studies:
- Removing the Intent State (making the model stateless) caused a massive performance drop (-10.8 AP), proving the memory mechanism is critical.
- Removing negative feedback caused a -5.9 AP drop, confirming the necessity of contrastive learning from rejection.
- Using a "max" operator for scoring outperformed an "averaging" strategy, suggesting that focusing on the strongest positive/negative signals is more effective.

5. Significance

Bridging the Gap: IntRec bridges the gap between static open-vocabulary detection and dynamic human-in-the-loop interaction. It solves a critical failure mode of current AI: the inability to handle "one-of-many" scenarios where objects look identical.
No Extra Supervision: The system achieves these gains without requiring additional labeled training data or fine-tuning; it learns purely through the interaction loop and contrastive refinement.
Practical Application: The low latency and high accuracy make this framework highly suitable for real-world applications such as human-robot collaboration, AR/VR assistance, and advanced visual search, where users often need to specify objects with vague or ambiguous descriptions.
Future Direction: The paper highlights that the current limitation lies in the initial candidate generation (if the detector misses the object entirely, refinement cannot recover it), pointing toward future work on feedback-driven proposal generation.