Selective Training for Large Vision Language Models via Visual Information Gain

This paper introduces Visual Information Gain (VIG), a perplexity-based metric that quantifies the contribution of visual input to prediction uncertainty, and leverages it to develop a selective training scheme that prioritizes visually informative samples and tokens to effectively mitigate language bias in Large Vision Language Models with reduced supervision.

Seulbi Lee, Sangheum Hwang

Published 2026-02-20
📖 5 min read🧠 Deep dive

Imagine you have a brilliant student named Visionary. Visionary is incredibly smart with words; they can write poetry, tell jokes, and answer trivia questions just by reading a book. However, Visionary has a bad habit: when you show them a picture and ask a question about it, they often ignore the picture entirely. Instead, they guess the answer based on what they think the picture should contain, relying on their memory of words rather than looking at the actual image.

If you show Visionary a picture of a cat and ask, "What animal is this?", they might confidently say "Dog" because they've read a million stories about dogs, even though the picture clearly shows a cat. This is called Language Bias.

The paper you shared introduces a new teaching method to fix this. Here is the breakdown using simple analogies:

1. The Problem: The "Lazy Student"

Currently, when training these AI models (called LVLMs), we feed them millions of examples. Some examples are easy (e.g., "What is the sky?" -> "Blue," which you can guess without looking). Some are hard (e.g., "What color is the specific bird in the corner?" -> "Red," which you must look at to know).

The problem is that the AI treats all these examples the same. It learns to take "shortcuts." It realizes it can often get the right answer just by reading the question and guessing, without actually looking at the image. It becomes a "text-only" model pretending to be a "vision" model.

2. The Solution: The "Visual Information Gain" (VIG) Score

The authors created a new tool called Visual Information Gain (VIG). Think of VIG as a "Surprise Meter" or a "Need-to-Look" Score.

  • How it works: The system asks the AI two questions:

    1. "What is the answer if I don't show you the picture?" (The AI guesses based on text).
    2. "What is the answer if I do show you the picture?" (The AI looks at the image).
  • The Score:

    • If the AI's answer changes significantly and becomes much more accurate when it sees the picture, the VIG score is high. This means the picture was essential to get the right answer.
    • If the AI gives the same answer (or a confident wrong one) whether it sees the picture or not, the VIG score is low. This means the picture didn't help; the AI was just guessing based on words.

3. The New Teaching Strategy: "Selective Training"

Instead of making Visionary study every single page of a textbook, the authors use the VIG score to create a customized study guide.

  • Filtering the Samples (The "What"): They throw away the "boring" examples where the AI didn't need to look at the picture to answer. They keep only the "juicy" examples where the picture was crucial.
  • Filtering the Words (The "How"): Even within a good picture, some words don't need the picture to be understood (like the word "the" or "and"). The system identifies the specific words that do need the picture (like "red," "left," "flying") and tells the AI: "Focus your brain power only on learning these specific words in relation to the image."

4. The Result: A Smarter, Faster Learner

By using this method, the AI learns much faster and better.

  • Less is More: The paper shows that by training on only 70% of the data (and focusing on even fewer specific words), the AI actually performs better than if it had studied 100% of the data.
  • No More Hallucinations: The AI stops making things up. If it sees a picture of a dog, it stops saying "There is a cat in the background" just because it's used to seeing cats in stories. It learns to trust its eyes.
  • Better Attention: The AI starts actually "looking" at the image parts that matter, rather than staring blankly while its brain processes text.

The Big Picture Metaphor

Imagine you are teaching a child to identify fruits.

  • Old Way: You show them 1,000 flashcards. Some say "What is this?" with a picture of an apple, but the child just memorizes the word "Apple" without looking. You waste time on cards where the answer is obvious from the text alone.
  • New Way (VIG): You use a "Surprise Meter." You only keep the cards where the child had to look at the picture to get it right (e.g., distinguishing a green apple from a red one). You also tell the child, "Don't worry about the word 'a' or 'the'; focus only on the word 'red' and the shape."

In short: This paper teaches AI to stop guessing based on words and start actually seeing by identifying and prioritizing the moments where the image truly matters. It's like turning a model that "reads the room" into one that actually "sees the room."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →