Selective Training for Large Vision Language Models via Visual Information Gain

Imagine you have a brilliant student named Visionary. Visionary is incredibly smart with words; they can write poetry, tell jokes, and answer trivia questions just by reading a book. However, Visionary has a bad habit: when you show them a picture and ask a question about it, they often ignore the picture entirely. Instead, they guess the answer based on what they think the picture should contain, relying on their memory of words rather than looking at the actual image.

If you show Visionary a picture of a cat and ask, "What animal is this?", they might confidently say "Dog" because they've read a million stories about dogs, even though the picture clearly shows a cat. This is called Language Bias.

The paper you shared introduces a new teaching method to fix this. Here is the breakdown using simple analogies:

1. The Problem: The "Lazy Student"

Currently, when training these AI models (called LVLMs), we feed them millions of examples. Some examples are easy (e.g., "What is the sky?" -> "Blue," which you can guess without looking). Some are hard (e.g., "What color is the specific bird in the corner?" -> "Red," which you must look at to know).

The problem is that the AI treats all these examples the same. It learns to take "shortcuts." It realizes it can often get the right answer just by reading the question and guessing, without actually looking at the image. It becomes a "text-only" model pretending to be a "vision" model.

2. The Solution: The "Visual Information Gain" (VIG) Score

The authors created a new tool called Visual Information Gain (VIG). Think of VIG as a "Surprise Meter" or a "Need-to-Look" Score.

How it works: The system asks the AI two questions:
1. "What is the answer if I don't show you the picture?" (The AI guesses based on text).
2. "What is the answer if I do show you the picture?" (The AI looks at the image).
The Score:
- If the AI's answer changes significantly and becomes much more accurate when it sees the picture, the VIG score is high. This means the picture was essential to get the right answer.
- If the AI gives the same answer (or a confident wrong one) whether it sees the picture or not, the VIG score is low. This means the picture didn't help; the AI was just guessing based on words.

3. The New Teaching Strategy: "Selective Training"

Instead of making Visionary study every single page of a textbook, the authors use the VIG score to create a customized study guide.

Filtering the Samples (The "What"): They throw away the "boring" examples where the AI didn't need to look at the picture to answer. They keep only the "juicy" examples where the picture was crucial.
Filtering the Words (The "How"): Even within a good picture, some words don't need the picture to be understood (like the word "the" or "and"). The system identifies the specific words that do need the picture (like "red," "left," "flying") and tells the AI: "Focus your brain power only on learning these specific words in relation to the image."

4. The Result: A Smarter, Faster Learner

By using this method, the AI learns much faster and better.

Less is More: The paper shows that by training on only 70% of the data (and focusing on even fewer specific words), the AI actually performs better than if it had studied 100% of the data.
No More Hallucinations: The AI stops making things up. If it sees a picture of a dog, it stops saying "There is a cat in the background" just because it's used to seeing cats in stories. It learns to trust its eyes.
Better Attention: The AI starts actually "looking" at the image parts that matter, rather than staring blankly while its brain processes text.

The Big Picture Metaphor

Imagine you are teaching a child to identify fruits.

Old Way: You show them 1,000 flashcards. Some say "What is this?" with a picture of an apple, but the child just memorizes the word "Apple" without looking. You waste time on cards where the answer is obvious from the text alone.
New Way (VIG): You use a "Surprise Meter." You only keep the cards where the child had to look at the picture to get it right (e.g., distinguishing a green apple from a red one). You also tell the child, "Don't worry about the word 'a' or 'the'; focus only on the word 'red' and the shape."

In short: This paper teaches AI to stop guessing based on words and start actually seeing by identifying and prioritizing the moments where the image truly matters. It's like turning a model that "reads the room" into one that actually "sees the room."

1. Problem Statement

Large Vision Language Models (LVLMs) have achieved significant success in multimodal tasks but suffer from a critical flaw known as language bias.

The Issue: LVLMs often rely heavily on textual priors (linguistic shortcuts) rather than visual evidence. This leads to visual ignorance (ignoring image content) and hallucinations (confidently describing objects or attributes not present in the image).
Root Cause: Existing training datasets contain a heterogeneous mix of samples. Some questions can be answered via common sense or text alone, while others require fine-grained visual understanding. Current training methods treat all samples and tokens with equal importance, failing to distinguish between "visually grounded" tokens (e.g., colors, spatial relations) and "text-only" tokens (e.g., articles, discourse markers).
Limitation of Prior Work: Previous mitigation strategies (e.g., contrastive decoding, attention modification) often operate at inference time or require architectural changes. They lack a quantitative metric to explicitly measure how much a specific training sample or token actually benefits from visual input.

2. Methodology: Visual Information Gain (VIG)

The authors propose a data-centric approach centered on a new metric called Visual Information Gain (VIG).

A. Definition of VIG

VIG is a perplexity-based metric that quantifies the reduction in prediction uncertainty provided by visual input. It is defined as the log-ratio between the model's perplexity (PPL) on an answer $A$ given a question $Q$ without visual input versus with visual input $I$ :

$\text{VIG} = \log \left( \frac{\text{PPL}(A | Q)}{\text{PPL}(A | Q, I)} \right)$

Interpretation:
- High Positive VIG: The image significantly reduces uncertainty; the answer relies heavily on visual evidence.
- Near-Zero/Negative VIG: The image adds little to no value, or the visual input contradicts the text (increasing uncertainty).
Implementation: To simulate the absence of visual information, the authors use a blurred image (removing semantic cues) to compute the baseline perplexity $PPL(A|Q)$, while the standard image is used for $PPL(A|Q, I)$.
Decomposition: VIG can be decomposed into token-level loss differences. This allows the authors to identify specific tokens (e.g., "red," "left," "sitting") that are visually grounded versus those that are syntactic or text-predictable.

B. VIG-Guided Selective Training

Based on VIG, the authors propose a two-stage selective training scheme:

Sample-Level Selection: All training samples are ranked by their sample-level VIG score. The top $p\%$ of samples (those with the highest visual dependency) are retained.
Token-Level Selection: Within the selected samples, only tokens with a token-level VIG score above a specific threshold $\tau_p$ contribute to the loss function. Tokens with low or negative VIG (text-only tokens) are masked out during backpropagation.

Goal: This forces the model to focus exclusively on learning from visually informative data, pruning "easy" text-based shortcuts.

3. Key Contributions

Visual Information Gain (VIG): Introduced a model-agnostic, decomposable metric to quantify visual dependency at both the sample and token levels.
Empirical Validation: Demonstrated that VIG correlates strongly with benchmark-level modality dependencies (e.g., high VIG on COCO captioning, lower/negative on text-heavy benchmarks like GQA) and successfully isolates visually grounded tokens (colors, attributes) from syntactic tokens.
Selective Training Framework: Proposed a training strategy that prioritizes high-VIG data. This approach improves visual grounding and reduces hallucinations while significantly reducing the amount of supervision required.
Orthogonality: Showed that VIG-based data selection is complementary to existing methods (both training-free and training-based), yielding additive performance gains when combined.

4. Experimental Results

The method was evaluated on LLaVA-1.5 (7B/13B) and ShareGPT4V (7B) across vision understanding and hallucination benchmarks.

Data Efficiency:
- LLaVA-1.5 7B: Achieved superior performance using only 38.45M active tokens (a 34% reduction in tokens) compared to the full dataset (58.61M).
- LLaVA-1.5 13B: Achieved even greater efficiency, optimizing on only 12.14M active tokens (a 79% reduction) while outperforming the vanilla baseline on all benchmarks.
Performance Gains:
- Vision Understanding: Improved scores on LLaVAW, MMVet, MMBench, and DocVQA.
- Hallucination Reduction: Significant improvements in POPE (F1 score), CHAIR (reduced hallucination rate), and MMHal.
Comparison with SOTA:
- Outperformed training-free methods (VCD, PAI, VAR) and training-based methods (LACING) on most metrics.
- Unlike LACING, which degraded performance on fine-grained document understanding (DocVQA), VIG training improved performance across all domains.
- Combining VIG with LACING achieved the highest overall scores.
Mechanism Analysis:
- Attention Shift: VIG-trained models allocate significantly more attention weights to visual tokens, particularly in middle layers crucial for semantic extraction.
- Robustness: In "blind faith in text" tests (where images are paired with corrupted, misleading captions), VIG-trained models maintained higher accuracy, proving they rely less on spurious textual cues.

5. Significance and Conclusion

This paper addresses the fundamental issue of language bias in LVLMs by shifting the focus from model architecture to data quality and selection.

Paradigm Shift: Instead of trying to force a model to look at images through architectural tweaks, the authors show that simply filtering training data to prioritize samples and tokens that require visual evidence is highly effective.
Efficiency: The method drastically reduces the computational cost of training (by reducing the number of active tokens) while simultaneously improving model reliability and reducing hallucinations.
Generalizability: The VIG metric is model-agnostic and can be applied to various LVLM architectures, offering a scalable solution for building more robust, visually grounded AI systems.

In summary, the work demonstrates that explicitly quantifying and prioritizing visual information gain is a powerful lever for training LVLMs that truly "see" rather than just "read" the image.

Selective Training for Large Vision Language Models via Visual Information Gain

1. The Problem: The "Lazy Student"

2. The Solution: The "Visual Information Gain" (VIG) Score

3. The New Teaching Strategy: "Selective Training"

4. The Result: A Smarter, Faster Learner

The Big Picture Metaphor

1. Problem Statement

2. Methodology: Visual Information Gain (VIG)

A. Definition of VIG

B. VIG-Guided Selective Training

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration