Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

Imagine you have a super-smart robot assistant that is great at looking at one picture and telling you what's happening. You ask, "What is in this photo?" and it answers perfectly.

But now, you show the robot two photos side-by-side and ask, "How are these two pictures different?" or "Do these two people know each other?"

Suddenly, the robot starts to hallucinate. It might say, "They are definitely friends!" even though they are in completely different countries, or it might invent details that aren't in either picture. It's like the robot is guessing based on what it thinks should happen, rather than actually looking at the evidence in front of it.

This paper introduces a new training method called CAPL (Cross-Image Attention Calibration and Preference Learning) to fix this. Here is how it works, using simple analogies:

1. The Problem: The "One-Way Street" Traffic Jam

Currently, most AI models look at multiple images like a line of people waiting in a single-file queue.

Image 1 is at the front.
Image 2 is behind it.
Image 3 is behind Image 2.

The rule is: Image 2 can look back at Image 1, but Image 1 cannot look forward at Image 2.

This creates a "one-way street" of information. When the robot tries to compare the two, it's like trying to compare two people when one of them is blindfolded and can only see what's behind them. The robot gets confused, relies on its own guesses (language habits), and makes up facts.

2. The Solution Part 1: Opening the "Two-Way Street" (Attention Calibration)

The first part of CAPL fixes the traffic flow.

The Fix: They install a "two-way mirror" between the images. Now, Image 1 can look at Image 2, and Image 2 can look at Image 1.
The "Key Token" Filter: However, if they look at every single pixel of both images, it gets too noisy and confusing (like trying to listen to a whole stadium shouting at once).
The Analogy: Imagine you are at a party with two groups of people. Instead of listening to everyone, you only pay attention to the loudest, most important speakers (the "key tokens") in each group. CAPL teaches the robot to focus only on these important details when comparing the images, ignoring the background noise.

3. The Solution Part 2: The "Good Cop, Bad Cop" Training (Preference Learning)

Just fixing the traffic isn't enough; the robot needs to learn to use the new two-way street. The authors use a clever training trick called DPO (Direct Preference Optimization).

Think of this as training a student with two types of exams:

The "Good Cop" (Positive Sample):
The robot is shown the two images with the two-way street open. It sees all the connections, finds the real differences, and gives the correct answer. This is the "good" answer we want.
The "Bad Cop" (Negative Sample):
Here is the creative part. To teach the robot what not to do, they deliberately break the connection between the images for this specific test. They force the robot to look at the images as if they are totally separate, with no way to compare them.
- Because the robot can't compare them, it gets confused and starts guessing wildly based on its old habits. It gives a "hallucinated" answer (e.g., "They are definitely friends!").
- The Lesson: The robot is then shown the "Good Answer" and the "Bad Answer" side-by-side. It is told: "You were wrong when you couldn't compare the images. You were right when you could. Next time, always compare them!"

By repeatedly showing the robot the difference between "guessing blindly" and "looking carefully at both," it learns to rely on the visual evidence rather than its imagination.

4. The Result: A Better Detective

After this training, the robot becomes a much better detective.

Multi-Image Tasks: It stops making up stories when comparing photos. It actually looks at the evidence.
Single-Image Tasks: Surprisingly, it doesn't get worse at looking at just one photo. In fact, because it learned to be more careful and precise, it sometimes gets better at single photos too.

Summary

The paper is about teaching AI to stop guessing when looking at multiple pictures.

Fix the view: Let the images "talk" to each other (Two-way street).
Focus the view: Only listen to the important parts (Key tokens).
Train the brain: Show the AI the difference between a lazy guess (Bad Cop) and a careful observation (Good Cop) so it learns to always choose the careful path.

The result is an AI that is less likely to lie to you and more likely to tell you the truth about what it sees.

Here is a detailed technical summary of the paper "Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation" (CAPL).

1. Problem Statement

Large Vision-Language Models (LVLMs) have achieved significant success in single-image tasks but struggle with multi-image hallucinations. In multi-image scenarios, models often generate plausible but factually incorrect answers by merging information from different images or relying on textual priors rather than visual evidence.

The authors identify two primary structural causes for this failure:

Unidirectional Information Flow: Standard LVLMs use a causal attention mechanism where later images can attend to earlier ones, but earlier images cannot access later ones. This creates an inherent positional bias and prevents symmetric, stable relational modeling between images.
Insufficient Cross-Image Interaction: Existing methods often treat images as independent contexts or use simple token concatenation without explicitly modeling fine-grained, bidirectional semantic associations. Consequently, the model degenerates into superficial text-based correlation matching rather than genuine visual reasoning.

2. Methodology: CAPL Framework

The proposed Cross-Image Attention calibration and Preference Learning (CAPL) framework addresses these issues through two core components: Attention Calibration and Attentive Preference Learning.

A. Selective Cross-Image Token Interaction (Attention Calibration)

To break the unidirectional causal bias, the authors introduce a mechanism that allows tokens from different images to interact bidirectionally.

Mask Modification: They modify the standard causal attention mask. Instead of blocking all future tokens, they remove the causal constraint between different images while preserving it within each image. This allows earlier images to "look back" at later images.
Key-Token Selection: To prevent redundant interactions and noise, they introduce a selective attention mechanism. Visual tokens are ranked by "response intensity" (embedding energy norm). Only the top $\rho$ % of tokens (key tokens) from each image are allowed to interact with key tokens from other images.
Hybrid Fusion: To maintain stability for tasks requiring strict temporal ordering (e.g., video or single-image queries), the final attention is a weighted average of the original causal attention and the new selective cross-image attention. Additionally, an alternating layer strategy is used (odd layers use cross-image masks, even layers use causal masks) to balance relational modeling with autoregressive stability.

B. Attentive Preference Learning (DPO)

Simply changing the inference-time attention is insufficient; the model weights must be updated to internalize this behavior. The authors propose a Direct Preference Optimization (DPO) strategy to train the model.

Positive Sample Construction ( $y^+$ ): Generated using the enhanced selective cross-image attention mechanism. These responses are further refined using an advanced model (Qwen3) to ensure factual correctness.
Negative Sample Construction ( $y^-$ ): Instead of using random hallucinations, the authors exploit the causal attention limitation to induce hallucinations. They construct a truncated attention mask that completely blocks all cross-image connections (forcing the model to treat images as independent). This forces the model to rely solely on individual image priors and text, generating responses that are highly likely to be hallucinated regarding cross-image relationships.
Training Objective: The model is trained using a hybrid loss function:
1. DPO Loss: Maximizes the likelihood of the positive (enhanced) response over the negative (truncated) response.
2. NLL Loss: A standard Negative Log-Likelihood loss on the positive samples to ensure the model learns the specific token-level generation trajectory of correct reasoning.

3. Key Contributions

Structural Analysis: Identified that asymmetric information flow and lack of bidirectional cross-image modeling are root causes of multi-image hallucinations.
Novel Framework (CAPL): Proposed a unified framework combining selective cross-image attention (to enable symmetric interaction) and contrastive preference learning (to train the model to rely on this interaction).
Truncated Attention for Negative Sampling: Introduced a unique method to generate high-quality negative samples by artificially truncating cross-image attention, effectively exposing the model's hallucination tendencies for targeted correction.
Generalization: Demonstrated that the method improves multi-image reasoning without degrading (and often slightly improving) single-image performance.

4. Experimental Results

The method was evaluated on three base models: Qwen2.5-VL, InternVL2.5, and GLM4.1VBase.

Multi-Image Hallucination Benchmarks:
- BLINK & MUIRBench: CAPL achieved consistent improvements across all models.
- MUIRBench (Complex Reasoning): The gains were most pronounced here, with improvements exceeding 3.5 points in some cases (e.g., Qwen2.5-VL improved from 58.42 to 62.00).
- GLM4.1VBase: Even a strong baseline saw steady gains (57.84 $\to$ 60.57), suggesting even advanced models suffer from naive attention structures in multi-image settings.
General Multi-Image Tasks:
- Benchmarks like NLVR2, QBench2, and MIBench showed stable or slightly improved performance, indicating the model learns to ground predictions in visual evidence rather than just text priors.
Single-Image Tasks:
- On benchmarks like POPE, CHAIR, and MMBench, performance remained stable or improved (e.g., POPE on Qwen2.5-VL increased from 81.23 to 82.94). This confirms the method does not overfit to multi-image contexts and retains general visual capabilities.
Ablation Studies:
- Attention vs. Training: Adding attention alone provided modest gains; adding DPO training provided significant boosts, confirming the synergy between structural changes and preference optimization.
- Truncated Negatives: Negative samples generated via truncated attention were significantly more challenging (lower accuracy) than standard negatives, providing stronger optimization signals.
- Hyperparameters: Optimal key-token selection ratios ( $\rho$ ) were found to be around 0.9–0.95, balancing information flow and noise suppression.

5. Significance

This paper presents a significant step forward in mitigating hallucinations in complex multi-modal scenarios.

Architectural Insight: It challenges the standard causal attention paradigm for multi-image tasks, proving that symmetric, bidirectional attention is crucial for relational reasoning.
Training Paradigm: The use of induced hallucinations (via truncated attention) as negative samples offers a novel, efficient way to train models to avoid specific reasoning errors without relying on expensive human-annotated negative data.
Practicality: The framework is model-agnostic (works on various LVLMs) and preserves single-image capabilities, making it a practical solution for real-world applications requiring multi-view comparison, cross-image retrieval, and complex visual reasoning.

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

1. The Problem: The "One-Way Street" Traffic Jam

2. The Solution Part 1: Opening the "Two-Way Street" (Attention Calibration)

3. The Solution Part 2: The "Good Cop, Bad Cop" Training (Preference Learning)

4. The Result: A Better Detective

Summary

1. Problem Statement

2. Methodology: CAPL Framework

A. Selective Cross-Image Token Interaction (Attention Calibration)

B. Attentive Preference Learning (DPO)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

A Hybrid Residue Floating Numerical Architecture with Formal Error Bounds for High Throughput FPGA Computation

On the Multi-Commodity Flow with convex objective function: Column-Generation approaches

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

Artificial Intelligence (AI) Maturity in Small and Medium-Sized Enterprises: A Framework of Internalized and Ecosystem-Embedded Capabilities