Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Imagine you are playing a game of "Find the Hidden Object" with a very smart, but slightly lazy, robot friend.

The Old Game: "The Easy Mode"

For years, researchers tested these robots using a standard game called Referring Expression Comprehension (REC). The rules were simple: You show the robot a picture and say, "Find the red ball." The robot has to point to the red ball.

The problem? The old game was too easy. It was like playing hide-and-seek in an empty room with only one person hiding.

Too Short: The clues were tiny, like just saying "Dog" instead of "The fluffy dog wearing a hat."
No Competition: Usually, there was only one dog in the picture. The robot didn't need to think; it just looked for "dog" and pointed.
The Cheat Code: Because the clues were so simple and there were no other dogs to confuse things, the robot could guess the answer without actually understanding the sentence. It was like solving a math problem by guessing the answer because the numbers were too small to matter.

Even though the robots were getting 90%+ scores, they were actually just "cheating" by taking shortcuts. They weren't really seeing or reasoning; they were just pattern matching.

The New Game: "Ref-Adv" (The Hard Mode)

The authors of this paper, a team from Northeastern University, decided to build a new, tougher version of the game called Ref-Adv. They wanted to see if the robots could actually think or if they would crash when the game got real.

Here is how they made the game harder, using some fun analogies:

1. The "Crowded Room" Analogy (Hard Distractors)
In the old game, if you said "Find the man," there was usually only one man. In Ref-Adv, the picture is a crowded party with 10 men.

The Twist: They don't just ask for "a man." They ask for "The man in the blue shirt who is NOT holding a drink, standing next to the woman with the red hat."
The Challenge: The robot has to ignore the other 9 men. If it just looks for "man," it fails. It has to process every detail of the sentence to find the one specific guy.

2. The "Minimalist Clue" Analogy (No Redundancy)
In the old game, clues were often over-the-top, like "The big, red, shiny, round, delicious apple on the table." Even if you ignored the words "big," "shiny," and "delicious," you could still find the apple because there was only one apple.

The Twist: In Ref-Adv, every word in the clue is essential. If you remove the word "blue" from "The blue car," the robot might pick the wrong car. The clues are "minimally sufficient"—just enough information to solve the puzzle, but no extra fluff to hide behind.

3. The "Negation" Trap
They added tricky clues like "Find the person who is NOT wearing a tie."

The Challenge: This forces the robot to look at everyone in the picture, check who is wearing a tie, and then mentally cross them out to find the one who isn't. It's a logic puzzle, not just a search.

The Results: The Robots Got Stumped

The researchers tested 13 of the smartest AI robots in the world (like GPT-4o, Gemini, and Qwen) on this new game.

On the old game (RefCOCO): The robots were champions, scoring over 90%. They looked like geniuses.
On the new game (Ref-Adv): Their scores plummeted. Many dropped to around 50% or lower.

What does this mean?
It's like a student who memorized the answers to a practice test with easy questions. When they took the real exam with tricky, multi-step logic problems, they failed. The robots were relying on "shortcuts" (guessing based on simple keywords) rather than genuine visual reasoning.

The "Thinking" Tool (Chain of Thought)

The researchers also tried giving the robots a "thinking tool" (called Chain of Thought), where they are forced to talk through their steps out loud before answering.

Result: It helped a little, but not enough to fix the problem. The robots still struggled to connect the complex sentence to the crowded image perfectly.

The Big Takeaway

This paper is a wake-up call. It tells us that just because an AI can get a high score on a standard test, it doesn't mean it truly understands the world.

Ref-Adv is the new "stress test" for AI. It's the difference between a robot that can say "I see a cat" and a robot that can say, "I see a cat, but that's not the one you want; you want the cat sleeping on the windowsill, not the one chasing the laser pointer."

The authors hope this new benchmark will force AI developers to build smarter, more reasoning-capable robots that can handle the messy, crowded, and complex reality of the real world, rather than just the clean, simple world of old datasets.

1. Problem Statement

The paper addresses a critical limitation in current Referring Expression Comprehension (REC) benchmarks (specifically RefCOCO, RefCOCO+, and RefCOCOg). While Multimodal Large Language Models (MLLMs) have achieved near-saturated performance (>90% accuracy) on these datasets, the authors argue that these scores do not reflect genuine visual reasoning or grounding capabilities. The existing benchmarks suffer from three primary flaws that allow models to use "shortcuts" rather than reasoning:

Short Expressions: The average referring expression is extremely short (~3 words), requiring minimal linguistic processing and reducing the need for complex visual verification.
Lack of Distractors: Images often contain very few objects of the same category as the target (distractors), making it easy for models to identify the target based solely on category without verifying specific descriptors.
Redundant Descriptors: Long, descriptive expressions are often paired with images lacking sufficient distractors. This creates "grounding shortcuts" where models can ignore most of the text and match only a subset of descriptors to find the target, paradoxically leading to high accuracy on longer expressions without true understanding.

2. Methodology: The Ref-Adv Benchmark

To address these issues, the authors introduce Ref-Adv, a modern REC benchmark designed to suppress shortcuts and enforce rigorous multi-step reasoning.

Data Collection Pipeline

The dataset is curated from COCO and OpenImages v7, filtered to ensure the presence of hard distractors (objects of the same category that partially match the description but are not the target). The generation process involves a two-stage LLM-assisted pipeline followed by human verification:

Similarity Judgement: An LLM (GPT-4o) identifies a target and its hardest distractor, generating a list of discriminative attributes that distinguish the pair from each other and from other objects in the scene.
Expression Generation: The LLM composes referring expressions using a minimal sufficient subset of these attributes. Crucially, the pipeline encourages the use of negation (e.g., "the person not wearing a necklace") to force the model to process the full logical structure of the sentence.
Human Verification: Three human annotators verify that the expression is unambiguous, accurate, and that hard distractors exist in the image. Only pairs with 100% agreement are kept (resulting in a 18.7% keep rate for LLM-generated data).

Key Dataset Statistics

Expression Length: Average length increased from ~3.6 words (RefCOCO) to 11.5 words.
Distractors: Average number of same-category distractors per image is 4.01 (compared to ~1.6 in RefCOCOg).
Negation: The ratio of expressions using explicit negation is 21.25%, significantly higher than in prior benchmarks.
Vocabulary: Expanded vocabulary size (5,308) compared to RefCOCOg (5,050).

3. Key Contributions & Ablation Studies

The authors validate the rigor of Ref-Adv through comprehensive ablation studies demonstrating that solving the task requires genuine reasoning:

Bag-of-Words (BoW) Test: When word order is randomized, model performance on Ref-Adv drops significantly (e.g., -16.8% for Qwen2.5-VL-72B), whereas the drop is smaller on RefCOCO. This proves Ref-Adv requires strict textual understanding, not just keyword matching.
Descriptor Deletion Test: Removing a single descriptor from the expression causes a significant performance drop on Ref-Adv. In contrast, RefCOCO models often maintain high accuracy even when descriptors are removed, confirming that RefCOCO relies on redundant cues (shortcuts).
Bias Test: Replacing the referring expression with a generic prompt ("the one") results in much lower accuracy on Ref-Adv compared to RefCOCO, indicating Ref-Adv is less susceptible to statistical biases in object localization.

4. Experimental Results

The authors evaluated 13 state-of-the-art MLLMs (including Qwen2.5-VL, InternVL-3, Gemini 2.5, GPT-4o, and Claude-3.5) on Ref-Adv.

Performance Drop: Despite achieving >90% accuracy on RefCOCO(+/g), models experience a marked performance drop on Ref-Adv. For example, Qwen2.5-VL-72B drops from ~92% on RefCOCO to 58.3% on Ref-Adv.
Impact of Chain-of-Thought (CoT): CoT prompting generally improves performance on Ref-Adv (e.g., +4.5% for Qwen2.5-VL-72B), whereas it offers limited or even negative benefits on RefCOCO. This suggests that Ref-Adv successfully isolates the reasoning capabilities that CoT is designed to enhance.
Distractor Sensitivity: Performance degrades significantly as the number of distractors increases. Models struggle most in images with $\ge$ 7 distractors, highlighting a gap in handling complex visual scenes.
Qualitative Failures: Analysis shows models frequently select the "hard distractor" as the answer, indicating failures in distinguishing subtle visual differences or misinterpreting negation.

5. Significance

Revealing the "Illusion" of Reasoning: The paper demonstrates that high scores on classic REC benchmarks are largely due to shortcut learning rather than robust visual grounding. Ref-Adv exposes the true limitations of current MLLMs in multi-step reasoning.
New Benchmark Standard: Ref-Adv sets a new standard for evaluating REC by enforcing minimal sufficiency (no redundant descriptors) and hard distractors, ensuring that models must process the entire expression and verify visual details.
Guidance for Future Research: The results suggest that future MLLM development must focus on improving fine-grained visual perception, handling negation, and integrating CoT reasoning for complex grounding tasks, rather than just scaling up training data on easy benchmarks.
Reproducibility: The authors release Ref-Adv-s, a curated subset of 1,142 cases with evaluation code, to facilitate reproducible benchmarking and fair comparison across the community.

In conclusion, Ref-Adv serves as a critical stress test for the visual reasoning capabilities of modern MLLMs, moving the field beyond superficial pattern matching toward genuine, multi-step visual-linguistic understanding.

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

The Old Game: "The Easy Mode"

The New Game: "Ref-Adv" (The Hard Mode)

The Results: The Robots Got Stumped

The "Thinking" Tool (Chain of Thought)

The Big Takeaway

1. Problem Statement

2. Methodology: The Ref-Adv Benchmark

Data Collection Pipeline

Key Dataset Statistics

3. Key Contributions & Ablation Studies

4. Experimental Results

5. Significance

More like this

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

CIPHER: Conformer-based Inference of Phonemes from High-density EEG

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

Skeleton-based Coherence Modeling in Narratives

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets