Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

The paper proposes SGREC, an interpretable zero-shot referring expression comprehension method that leverages query-driven scene graphs to bridge the gap between low-level visual features and high-level semantic reasoning, enabling Large Language Models to accurately locate target objects and provide detailed explanations without task-specific training.

Yike Wu, Necva Bolucu, Stephen Wan, Dadong Wang, Jiahao Xia, Jian Zhang

Published 2026-03-27
📖 4 min read☕ Coffee break read

Imagine you are playing a game of "I Spy" with a friend, but your friend is blindfolded and can only see the world through a very literal, robotic camera. Your job is to describe a specific object in the room so your friend can point to it.

  • The Problem: If you say, "Find the red cup," a simple robot might get confused if there are three red cups. If you say, "Find the cup next to the sad-looking dog," the robot might not understand what "sad-looking" means or how "next to" works in 3D space.
  • The Old Way: Previous AI models tried to solve this by just matching words to pictures. They'd look for the word "cup" and the word "red" and guess. But they often missed the context (like the sad dog) or got confused by complex descriptions.
  • The New Way (SGREC): This paper introduces a new method called SGREC. Think of it as hiring a super-smart translator who doesn't just translate words, but first draws a detailed map of the room before telling the blindfolded friend where to look.

Here is how SGREC works, broken down into three simple steps:

Step 1: The Detective (Finding the Clues)

First, the AI looks at the photo and the sentence you gave it (e.g., "The tall vase with flowers").

  • Instead of guessing, it acts like a detective. It reads your sentence, picks out the important words ("vase," "flowers," "tall"), and then scans the photo to find every object that might fit that description.
  • It doesn't just look for the word "vase"; it asks, "Is there a vase? Is it tall? Is it holding flowers?" It gathers a list of suspects.

Step 2: The Cartographer (Drawing the Scene Graph)

This is the magic part. Instead of just showing the AI the raw photo, it builds a Scene Graph.

  • The Analogy: Imagine taking a photo of a messy living room and turning it into a structured cheat sheet.
  • This cheat sheet lists every object found in Step 1. For each object, it writes down:
    • Where it is: Exact coordinates (like GPS for the object).
    • What it looks like: A detailed description generated by an AI artist (e.g., "A twisted red vase with blue birds painted on it").
    • How it relates to others: It draws lines between objects, like "The vase is holding the flowers" or "The vase is to the left of the lamp."
  • This transforms a chaotic image into a neat, organized story. It bridges the gap between "seeing pixels" and "understanding a story."

Step 3: The Judge (The Big Brain)

Now, the AI hands this "cheat sheet" (the Scene Graph) to a Large Language Model (LLM)—basically a super-smart chatbot that is great at reading and reasoning.

  • You ask the chatbot: "Based on this list of objects and their relationships, which one matches 'the tall vase with flowers'?"
  • Because the chatbot is reading a clear, structured story (the cheat sheet) rather than staring at a confusing picture, it can easily reason: "Ah, Object #3 is the only one described as 'tall' and 'holding flowers.' Object #1 is short, and Object #2 has no flowers."
  • The chatbot points to the correct object and explains why it chose it.

Why is this a big deal?

  1. It's "Zero-Shot": This means the AI didn't need to practice on thousands of "vase" or "dog" examples to learn. It uses its general knowledge (like a human who has seen many vases) to figure out new, weird situations instantly.
  2. It's Explainable: Unlike other AI models that just give you a "black box" answer, SGREC can tell you why it picked that object. It's like a teacher showing their work on a math test.
  3. It Handles Complexity: It is much better at understanding tricky queries like "the second lamp from the left" or "the dog that is looking at the cat," because it explicitly maps out those relationships.

In a nutshell:
Old AI tried to guess the answer by squinting at the picture. SGREC takes a step back, draws a detailed map of the scene with all the relationships written down, and then asks a super-smart reader to solve the puzzle based on that map. This makes it incredibly good at finding things in pictures, even when it has never seen that specific picture before.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →