MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

The paper introduces MIRROR, a framework that enhances Vision-Language Models' reasoning and reduces hallucinations by implementing a closed-loop iterative process of drafting, critiquing, and revising answers through explicit region-based visual verification, supported by the newly constructed ReflectV dataset.

Haoyu Zhang, Yuwei Wu, Pengxiang Li, Xintong Zhang, Zhi Gao, Rui Gao, Mingyang Gao, Che Sun, Yunde Jia

Published 2026-02-25
📖 4 min read☕ Coffee break read

Imagine you are looking at a complex picture, like a busy street scene or a tricky math diagram, and someone asks you a question about it.

The Problem: The "Confident but Wrong" AI
Current AI models (Vision-Language Models) are like students who are very good at reading but sometimes terrible at looking. If you ask them, "How many planes are in this picture?", they might confidently say "Four" because that's what their training data suggests is a common number. But if you look closely, there are actually five planes, with one hiding behind another.

The AI doesn't "see" the hidden plane. It just guesses based on language patterns. This is called a hallucination. Even if you tell the AI, "Wait, think again," it usually just rewrites its answer using more words, still missing the visual clue. It's like a student who, when told their answer is wrong, just changes the sentence structure without actually checking the textbook again.

The Solution: MIRROR (The "Look-Back" AI)
The paper introduces MIRROR, a new way to teach AI how to "look again."

Think of MIRROR as a detective who doesn't just guess; they have a magnifying glass and a highlighter.

Here is how MIRROR works, step-by-step, using a simple analogy:

1. The First Guess (The Draft)

The AI looks at the image and gives an answer.

  • Analogy: You take a quick glance at a messy room and guess, "There are 5 toys on the floor."

2. The Self-Check (The Reflection)

Instead of just moving on, MIRROR pauses and asks itself: "Am I sure? Did I miss anything?"

  • Analogy: You stop and think, "Wait, I might have missed the toy under the rug. I shouldn't just guess; I need to be sure."

3. The "Look-Back" (The Visual Tool)

This is the magic part. MIRROR doesn't just think about the missing toy; it actively points a digital highlighter at the specific spot on the image where it thinks it might be wrong. It draws a circle or a dot on the screen to force itself to look exactly there.

  • Analogy: You physically point your finger at the rug and say, "Let me check this specific spot." The AI uses a tool to draw a circle around the hidden plane or the specific letter it missed.

4. The Correction (The Revision)

Now that the AI has "zoomed in" on the highlighted area, it sees the evidence clearly. It updates its answer.

  • Analogy: You see the toy under the rug. You say, "Ah! There's a sixth toy! My answer was wrong. The correct number is 6."

Why is this different?

Previous methods were like a student rewriting an essay to sound smarter without checking the facts. MIRROR is like a student who, when unsure, opens the textbook, finds the exact page, and reads the evidence before writing the final answer.

The "Training School" (ReflectV)

To teach the AI this skill, the researchers built a special dataset called ReflectV.

  • Analogy: Imagine a teacher who doesn't just grade a student's test. Instead, the teacher creates a "replay" of the student's mistakes. The teacher says, "You missed this part. Here is a red circle around the mistake. Now, look at the red circle and tell me what you see."
  • The AI practices this thousands of times, learning that the only way to get a good grade is to point at the image and verify the details, not just guess.

The Result

When tested, MIRROR is much better at:

  • Counting things (not missing the hidden objects).
  • Reading text in images (not hallucinating words that aren't there).
  • Solving logic puzzles (checking the visual evidence before concluding).

In short: MIRROR teaches AI to stop guessing and start verifying. It turns the AI from a confident guesser into a careful investigator who uses a digital highlighter to ensure every answer is grounded in what is actually visible in the picture.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →