Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Imagine you are playing a game of "I Spy" with a friend, but your friend is blindfolded and can only see the world through a very literal, robotic camera. Your job is to describe a specific object in the room so your friend can point to it.

The Problem: If you say, "Find the red cup," a simple robot might get confused if there are three red cups. If you say, "Find the cup next to the sad-looking dog," the robot might not understand what "sad-looking" means or how "next to" works in 3D space.
The Old Way: Previous AI models tried to solve this by just matching words to pictures. They'd look for the word "cup" and the word "red" and guess. But they often missed the context (like the sad dog) or got confused by complex descriptions.
The New Way (SGREC): This paper introduces a new method called SGREC. Think of it as hiring a super-smart translator who doesn't just translate words, but first draws a detailed map of the room before telling the blindfolded friend where to look.

Here is how SGREC works, broken down into three simple steps:

Step 1: The Detective (Finding the Clues)

First, the AI looks at the photo and the sentence you gave it (e.g., "The tall vase with flowers").

Instead of guessing, it acts like a detective. It reads your sentence, picks out the important words ("vase," "flowers," "tall"), and then scans the photo to find every object that might fit that description.
It doesn't just look for the word "vase"; it asks, "Is there a vase? Is it tall? Is it holding flowers?" It gathers a list of suspects.

Step 2: The Cartographer (Drawing the Scene Graph)

This is the magic part. Instead of just showing the AI the raw photo, it builds a Scene Graph.

The Analogy: Imagine taking a photo of a messy living room and turning it into a structured cheat sheet.
This cheat sheet lists every object found in Step 1. For each object, it writes down:
- Where it is: Exact coordinates (like GPS for the object).
- What it looks like: A detailed description generated by an AI artist (e.g., "A twisted red vase with blue birds painted on it").
- How it relates to others: It draws lines between objects, like "The vase is holding the flowers" or "The vase is to the left of the lamp."
This transforms a chaotic image into a neat, organized story. It bridges the gap between "seeing pixels" and "understanding a story."

Step 3: The Judge (The Big Brain)

Now, the AI hands this "cheat sheet" (the Scene Graph) to a Large Language Model (LLM)—basically a super-smart chatbot that is great at reading and reasoning.

You ask the chatbot: "Based on this list of objects and their relationships, which one matches 'the tall vase with flowers'?"
Because the chatbot is reading a clear, structured story (the cheat sheet) rather than staring at a confusing picture, it can easily reason: "Ah, Object #3 is the only one described as 'tall' and 'holding flowers.' Object #1 is short, and Object #2 has no flowers."
The chatbot points to the correct object and explains why it chose it.

Why is this a big deal?

It's "Zero-Shot": This means the AI didn't need to practice on thousands of "vase" or "dog" examples to learn. It uses its general knowledge (like a human who has seen many vases) to figure out new, weird situations instantly.
It's Explainable: Unlike other AI models that just give you a "black box" answer, SGREC can tell you why it picked that object. It's like a teacher showing their work on a math test.
It Handles Complexity: It is much better at understanding tricky queries like "the second lamp from the left" or "the dog that is looking at the cat," because it explicitly maps out those relationships.

In a nutshell:
Old AI tried to guess the answer by squinting at the picture. SGREC takes a step back, draws a detailed map of the scene with all the relationships written down, and then asks a super-smart reader to solve the puzzle based on that map. This makes it incredibly good at finding things in pictures, even when it has never seen that specific picture before.

1. Problem Definition

Referring Expression Comprehension (REC) is the task of identifying a specific object in an image based on a natural language query (e.g., "the red vase on the left").

Zero-shot Setting: The challenge addressed is Zero-shot REC, where the model must perform this task without any task-specific training data (fine-tuning). It must generalize to unseen object categories and novel query structures.
Limitations of Existing Methods:
- Vision-Language Models (VLMs) like CLIP: While they align text and image features, they often struggle with fine-grained visual details, complex spatial relationships, and logical reasoning about object interactions. They treat images as "bags of words" rather than structured scenes.
- Large Language Models (LLMs): While excellent at semantic reasoning, they lack the ability to directly perceive and abstract visual features from raw pixels into textual semantics without intermediate structured representation.
- Existing Scene Graph Approaches: Previous methods often rely on fixed predicate classifiers or fuse embeddings, failing to leverage the full reasoning power of LLMs on structured textual representations.

2. Methodology: SGREC Framework

The authors propose SGREC, a framework that bridges the gap between low-level visual perception and high-level semantic reasoning by using Query-driven Scene Graphs as structured intermediaries. The pipeline consists of three main steps (illustrated in Figure 3 of the paper):

Step 1: Object Grounding (Query-Related Object Selection)

Instead of processing all detected objects, the system filters for objects relevant to the query.

Noun Extraction: Uses SpaCy to extract nouns from the input query.
Category Prediction: Maps extracted nouns to known object categories (e.g., COCO classes).
Subject Inference: Uses a VLM (LLaVA) to infer the "subject" of the query from the image context (e.g., resolving "left thing" to "giraffe").
Selection: Detected objects are retained if their class labels have a high cosine similarity (using Word2Vec) with the extracted nouns, categories, or inferred subjects.

Step 2: Query-driven Scene Graph Generation

For the selected objects, a structured scene graph is generated to serve as the input for the LLM. This graph includes:

Spatial Information: Bounding box coordinates $(x_1, y_1, x_2, y_2)$ are included directly, allowing the LLM to perform numerical reasoning to deduce spatial relationships (e.g., "left," "top," "taller").
Object Captions: Instead of relying on limited attribute lists, the system crops each object region and uses a VLM (LLaVA) to generate a rich, descriptive caption (e.g., "A rustic blue and brown ceramic bowl filled with..."). This captures fine-grained details like texture, state, and specific actions.
Interaction Prediction: The system identifies potential object pairs (based on spatial overlap) and uses the VLM to predict relationship triplets (e.g., [woman, wearing, hat]). To reduce confusion in crowded scenes, the VLM is prompted with visual highlighting (red/blue boxes) for the specific pair.
Output Format: The scene graph is serialized into a JSON format containing IDs, labels, coordinates, attributes, captions, and relationships.

Step 3: LLM Inference

The generated JSON scene graph and the original natural language query are fed into a Large Language Model (LLM).
Prompting: The LLM is instructed to analyze the structured data and select the object ID that best matches the query.
Interpretability: The LLM is required to output not just the ID, but a detailed explanation of its reasoning (e.g., "Object 3 is selected because it is the only open silver laptop...").
Final Output: The ID of the selected object is used to retrieve the corresponding bounding box as the final prediction.

3. Key Contributions

Novel Framework: Introduction of SGREC, the first zero-shot REC method to explicitly integrate query-driven scene graphs with LLMs for interpretable object localization.
Structured Intermediary: Development of a scene graph generation module that captures spatial coordinates, rich object captions, and semantic interactions, effectively bridging the visual-textual gap without fine-tuning.
State-of-the-Art Performance: The method achieves leading performance on standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) in a zero-shot setting, outperforming both zero-shot VLM baselines and some weakly-supervised methods.
Interpretability: The framework provides human-readable explanations for its decisions, enhancing trust and debuggability.

4. Experimental Results

The method was evaluated on RefCOCO, RefCOCO+, and RefCOCOg datasets.

Performance:
- RefCOCO val: 66.78% (Top-1 Accuracy)
- RefCOCO+ testB: 53.43%
- RefCOCOg val: 73.28%
- SGREC outperforms strong zero-shot baselines like ZeroshotREC(VRCLIP) and MCCE-REC by significant margins (e.g., >10% on RefCOCOg).
- It even surpasses some fully-supervised and weakly-supervised methods (e.g., outperforming CPL by >13% on RefCOCOg), demonstrating the power of the structured reasoning approach.
Ablation Studies:
- Object Selection: Combining Nouns, Categories, and Subject Inference yields the best results.
- Graph Components: Adding Object Captions and Interaction relationships significantly boosts performance, especially on datasets requiring fine-grained appearance or relational reasoning (RefCOCO+ and RefCOCOg).
- LLM Size: Larger models (Qwen-72B, LLaMA-70B) perform better, but even smaller models (Qwen-7B) show substantial gains over baselines, proving the efficacy of the scene graph structure itself.
Robustness: The model remains robust across different decoding parameters and handles dense scenes (up to 20+ objects) effectively, though performance slightly degrades as scene density increases.

5. Significance and Conclusion

Paradigm Shift: SGREC shifts the paradigm from direct feature alignment (pixel-to-text) to structured reasoning. By converting visual scenes into structured text (JSON scene graphs), it unlocks the full reasoning capabilities of pre-trained LLMs for visual tasks.
Zero-Shot Viability: It demonstrates that high-performance REC is achievable without task-specific training data, relying instead on the generalization capabilities of foundation models (VLMs for perception, LLMs for reasoning).
Interpretability: Unlike "black box" deep learning models, SGREC provides explicit reasoning chains, making it suitable for applications where understanding why a decision was made is crucial.
Limitations: The primary trade-off is computational cost, as the pipeline requires running both a VLM (for graph generation) and an LLM (for inference) per image. Future work aims to unify these architectures for better efficiency.

In summary, SGREC effectively solves the zero-shot REC problem by decomposing the task into visual grounding, structured scene graph construction, and textual reasoning, achieving state-of-the-art results while maintaining high interpretability.