ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations

Imagine you have a very smart, well-read friend who loves looking at pictures. This friend is great at describing what they see, but they have a funny quirk: they sometimes get the relationships between things wrong.

For example, if you show them a photo of a man riding a surfboard, they might confidently say, "Yes, the man is standing on the surfboard!" They see the man, they see the board, but they mix up the action. In the world of AI, this is called a "relation hallucination."

The paper you shared introduces a new method called ChainMPQ to fix this. Think of ChainMPQ not as a magic spell, but as a detective's checklist that forces the AI to slow down and think step-by-step, just like a human would.

Here is how ChainMPQ works, broken down into simple analogies:

1. The Problem: The "Jumping to Conclusions" AI

Current AI models are like students who are great at memorizing facts but bad at paying attention to details. If they see a man and a surfboard, their brain immediately screams, "Surfing!" because that's what usually happens. They skip the part where they actually look at the specific pose in the photo. They rely on "language priors" (what they expect to happen) rather than "visual evidence" (what is actually there).

2. The Solution: ChainMPQ (The "Interrogation" Method)

ChainMPQ stops the AI from guessing the final answer immediately. Instead, it forces the AI to go through a three-step detective process before it's allowed to give the final verdict.

Step A: The "Spotlight" (Text-Guided Attention)

Imagine the AI is in a dark room looking at a messy photo. ChainMPQ shines a flashlight specifically on the important characters.

If the question is about a "man" and a "surfboard," ChainMPQ tells the AI: "Hey, ignore the ocean and the sky for a second. Focus your eyes ONLY on the man and the board."
This ensures the AI doesn't get distracted by the background.

Step B: The "Five-Question Interrogation" (Multi-Perspective Questions)

Instead of asking the AI, "Is the man standing on the board?" (which is too easy to guess), ChainMPQ breaks the question down into five smaller, simpler questions. It's like a lawyer cross-examining a witness:

Where is the man? (Locate the subject)
Where is the board? (Locate the object)
What is the man doing? (Ignore the board, just look at the man)
What is happening to the board? (Ignore the man, just look at the board)
What is the relationship between them? (Now, combine the answers from 1–4)

By answering these one by one, the AI is forced to build a logical story. It can't just guess "standing" because it has to first admit, "The man is sitting," in question #3.

Step C: The "Memory Chain" (Interleaved Reasoning)

This is the secret sauce. When the AI answers question #3, it doesn't just forget that answer. It writes it down in a notebook and keeps it open while answering question #4 and #5.

It also keeps a visual map of where it looked for the previous answers.
So, when it finally answers the big question ("Is he standing?"), it looks at its notebook: "Wait, I already wrote down in step #3 that he is sitting. Therefore, he cannot be standing."

3. The Result: A More Honest AI

In the paper's examples:

Without ChainMPQ: The AI sees a man on a board and says, "Yes, he is standing!" (Hallucination).
With ChainMPQ: The AI goes through the checklist. It realizes the man is actually sitting or kneeling. It corrects itself and says, "No, he is riding/sitting."

Why is this cool?

No Retraining: You don't need to teach the AI a new language or feed it millions of new pictures. You just change how you ask the questions. It's like giving a smart student a better study guide rather than making them go to a different school.
Works Everywhere: It works on different types of AI models, just like a good checklist works for any detective.
Human-Like: It mimics how humans think. We don't usually guess the whole story at once; we look at the pieces, figure out the parts, and then put the puzzle together.

In a nutshell: ChainMPQ stops the AI from being a "fast guesser" and turns it into a "slow thinker," ensuring that when it describes a relationship in a picture, it's actually looking at the picture, not just guessing based on what it thinks should be there.

1. Problem Statement

Relation Hallucinations in LVLMs:
While Large Vision-Language Models (LVLMs) excel in multimodal tasks, they frequently suffer from hallucinations. These are categorized into three types: object, attribute, and relation hallucinations.

The Gap: Relation hallucinations (where the model correctly identifies objects but fails to infer the correct relationship between them) account for nearly 40% of all hallucinations yet have received the least attention compared to object/attribute errors.
Current Limitations: Existing methods often treat relational reasoning as a single-step inference, relying heavily on language priors rather than systematic visual analysis. This leads to errors where models guess relationships based on text patterns instead of visual evidence.
Goal: To develop a training-free method that forces the model to perform step-by-step, systematic relational reasoning by leveraging accumulated textual and visual memories.

2. Methodology: ChainMPQ

ChainMPQ (Multi-Perspective Questions guided Interleaved Text-image Reasoning Chain) is a training-free framework that decomposes relational inference into manageable steps. It consists of three core modules:

A. Text-Guided Attention Enhancement

Mechanism: The system extracts subject ( $S$ ) and object ( $O$ ) keywords from the input question using an NLP toolkit (e.g., spaCy).
Action: These keywords are encoded and used to drive cross-attention on the visual features. The image features act as the Query, while the keyword text acts as Key and Value.
Result: This produces enhanced visual tokens ( $V'$ ) that emphasize the specific regions containing the subject and object, establishing a strong visual foundation for subsequent reasoning.

B. Construction of Multi-Perspective Text Prompts

Instead of asking the original question directly, ChainMPQ decomposes the query into five complementary questions to force a structured analysis:

Localization: "Where is the [Subject]?" ( $Q_1$ )
Localization: "Where is the [Object]?" ( $Q_2$ )
Subject-Centric Relation: "What is the [Subject] doing/interacting with?" (Masking $O$ )
Object-Centric Relation: "What is affecting the [Object]?" (Masking $S$ )
General Relation: "What is the relationship between [Subject] and [Object]?" (Masking $R$ )

Strategy: This decomposition prevents the model from jumping to conclusions based on priors, forcing it to analyze individual components before synthesizing the final relationship.

C. Interleaved Text-Image Reasoning Chain

This is the core innovation where textual and visual memories are accumulated and transferred across steps:

Sequential Processing: The five questions are fed into the model sequentially.
Memory Transfer:
- Textual Memory: Answers from previous steps ( $A_1, A_2, \dots$ ) are accumulated to form a context for subsequent questions.
- Visual Memory (Bias Mask): For questions $Q_3$ through $Q_5$ , the model extracts attention weights from the final decoder layers to identify the most relevant visual tokens (top- $k$ ).
- Attention Modification: These top- $k$ tokens form a bias mask ( $M_i$ ). When answering the next question, the attention mechanism is modified to include this bias:
  $\text{Attn}_{i+1} = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + \alpha_i \cdot M_i\right)V$
  Where $\alpha_i$ is a confidence-based weight.
Final Output: The accumulated evidence from the chain is used to answer the original question, ensuring the final decision is grounded in the step-by-step visual analysis.

3. Key Contributions

Decomposition Strategy: Introduced a subject–object–relation decomposition that generates multi-perspective questions, forcing the model to focus on each core element of a relationship individually.
Interleaved Chain Mechanism: Designed a novel mechanism that transfers both textual answers and visual attention maps (as bias masks) from earlier reasoning steps to refine subsequent steps, enabling progressive relational inference.
Training-Free & Generalizable: The method requires no model retraining or fine-tuning. It is model-agnostic and has been validated across diverse LVLM architectures (LLaVA, InstructBLIP, Qwen-VL, InternVL).

4. Experimental Results

Benchmarks: Evaluated on MMRel and R-Bench, two benchmarks specifically designed for relation hallucination.
Models Tested: LLaVA-1.5, InstructBLIP, Qwen2.5-VL, and InternVL3.5.

Key Findings:

Performance Gains: ChainMPQ consistently outperformed baselines (Vanilla, standard CoT, and other training-free methods like Prompting and Calibrate).
- On LLaVA-1.5 (MMRel), accuracy improved from 59.02% (Vanilla) to 65.20% (ChainMPQ).
- On InternVL3.5 (R-Bench), accuracy improved from 82.33% to 85.05%.
Precision Improvement: The method showed significant gains in precision (e.g., +4.17% on R-Bench for LLaVA), indicating a substantial reduction in false positive relation predictions.
Ablation Studies:
- Removing Multi-perspective questions caused the largest drop (-3.68%), confirming the importance of decomposition.
- Removing the Interleaved chain (visual memory transfer) caused a -3.08% drop, proving the necessity of visual bias propagation.
- Removing Attention Enhancement caused a -1.14% drop.
Efficiency (Light Variants): The authors proposed "Light1" (using only $Q_1, Q_2, Q_5$ ) which achieved a better accuracy-latency trade-off, offering high accuracy gains with significantly reduced inference time compared to the full chain.

5. Significance and Future Work

Significance: ChainMPQ addresses a critical blind spot in LVLMs (relation hallucinations) without requiring expensive retraining. By mimicking human reasoning (locating objects $\to$ analyzing interactions $\to$ synthesizing conclusions), it significantly improves the factual reliability of multimodal models.
Visual Grounding: Case studies show that ChainMPQ produces attention maps that are more sharply concentrated on relevant subject-object interactions, effectively suppressing background noise and language priors.
Future Directions:
- Incorporating causality-based attribution to better detect and correct contradictions.
- Addressing spatial granularity issues where visual tokens may not perfectly align with real-world object boundaries, potentially via multi-scale representations or explicit scene graphs.

In summary, ChainMPQ represents a robust, training-free advancement in multimodal reasoning, demonstrating that structured, interleaved text-image chains can effectively mitigate the pervasive issue of relation hallucinations in current AI models.