See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

This paper proposes "See It, Say It, Sorted," a lightweight, training-free, and plug-and-play framework that mitigates visual hallucination in large vision-language models by iteratively supervising each reasoning step with dynamically extracted visual evidence, thereby significantly improving reasoning accuracy without requiring additional model training.

Yongchang Zhang, Oliver Ma, Tianyi Liu, Guangquan Zhou, Yang Chen

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are trying to solve a complex puzzle while looking at a picture, but you have to describe the solution out loud, step-by-step. This is what Large Vision-Language Models (LVLMs) do: they look at an image and "think" through a problem using words.

The paper introduces a new method called ECRD (Evidence-Constrained Reweighted Decoding). Think of it as a "See It, Say It, Sorted" system. Here is how it works, explained through simple analogies:

The Problem: The "Whispering Gallery" Effect

Imagine you are in a long hallway (the reasoning chain) trying to describe a painting at the very end. As you walk down the hall, you start repeating what you think you saw, rather than what is actually there.

  • The Issue: If you make one small mistake early on (e.g., "I think that's a red car"), your brain gets convinced by your own voice. Even if you look at the picture again later, your brain ignores the truth because you've already built a story around the "red car."
  • The Result: The model gets confident but wrong. It hallucinates (makes things up) because it lost track of the visual evidence.

The Old Solution: Hiring a Detective (Too Expensive)

Previously, researchers tried to fix this by training the model to be a detective. They taught the AI to stop, zoom in on specific parts of the image, crop them out, and re-examine them.

  • The Downside: This is like hiring a full-time detective for every single step of your puzzle. It's slow, expensive, and requires special training for every new type of puzzle.

The New Solution: The "Fact-Checker" and the "Flashlight"

The authors propose a lightweight, training-free framework. They don't retrain the AI; they just give it a better way to think while it's answering.

1. The "Evidence Pool" (The Fact-Checker)

Imagine the AI is writing a story. Next to it sits a Fact-Checker holding a notepad called the Evidence Pool.

  • Every time the AI writes a sentence, the Fact-Checker looks at the notepad.
  • If the AI says, "The car is red," the Fact-Checker checks the notepad. If the notepad says, "I saw a blue car," the Fact-Checker gently nudges the AI: "Hey, are you sure? The evidence says blue."
  • The AI then adjusts its confidence. It doesn't stop talking; it just shifts its probability to make the "blue" answer more likely.

2. The "Visual Decider" (The Flashlight)

Sometimes, the Fact-Checker isn't sure either. The AI is stuck between two answers (e.g., "Is it 3 or 5?"), and the notepad doesn't have enough info.

  • The Trigger: This is where the Visual Decider steps in. Think of it as a Flashlight.
  • Instead of re-reading the whole picture, the Flashlight shines only on the specific confusing spot.
  • It takes a quick snapshot, writes a tiny note (e.g., "The number behind the box is 300"), and adds that note to the Evidence Pool.
  • Now, the AI has a new fact. It uses that fact to finish the rest of the story correctly.

Why is this special?

  1. It's "Plug-and-Play": You don't need to retrain the AI. You just attach this "Fact-Checker" system to any existing model, like adding a new app to your phone.
  2. It's Efficient: The Flashlight only turns on when the AI is confused. If the AI is confident, it keeps talking without interruption. This saves time and money.
  3. It Stops the "Domino Effect": By catching small errors early with the Flashlight, it prevents the whole chain of reasoning from collapsing.

The Result

When they tested this on various puzzles (like counting objects, reading charts, or spotting hidden details), the AI became much smarter.

  • It made fewer mistakes (hallucinations).
  • It got more correct answers.
  • It did all this without needing a massive, expensive retraining session.

In short: The paper teaches AI to stop and check its facts against the image while it thinks, using a smart, on-demand "Flashlight" to clear up confusion, ensuring the final answer is grounded in reality, not just a confident guess.