Enhancing Multi-Image Understanding through Delimiter Token Scaling

The Problem: The "Cafeteria Chaos"

Imagine you are a chef (the AI) trying to cook three different recipes at the same time. You have three separate recipe cards on your counter: one for a cake, one for soup, and one for a salad.

In a perfect world, you would read the cake card, cook the cake, put it aside, then read the soup card, and so on.

But in the current version of these AI models, it’s like the recipe cards are all sliding together into one giant, messy pile. When you try to read the "cake" instructions, your brain accidentally grabs a piece of "soup" instruction. You end up putting salt in the cake or putting whipped cream on the soup.

In technical terms, this is called Cross-Image Information Leakage. The AI looks at multiple images at once, but it gets confused about which details belong to which picture. It mixes them up, leading to wrong answers.

The Current Fix: The "Sticky Note" (That Doesn't Stick)

To stop this mess, engineers already put special "sticky notes" (called Delimiter Tokens) between the images.

Image 1 -> [STICKY NOTE] -> Image 2 -> [STICKY NOTE] -> Image 3

The idea is that the sticky note acts as a wall, telling the AI, "Stop! What you just read was Image 1. What comes next is Image 2."

However, the researchers in this paper found that these sticky notes are made of weak paper. The AI can still see through them. The "soup" instructions are still leaking into the "cake" section. The AI knows there is a wall, but it’s not strong enough to stop the confusion.

The Solution: The "Super-Strong Wall"

The researchers asked: What if we made those sticky notes super strong?

They discovered that these "sticky notes" (delimiter tokens) act like magnets or anchors for the AI's attention.

The Anchor Effect: When the AI looks at a specific image, it naturally looks at the sticky note next to it to say, "Okay, I'm in this zone."
The Scaling Trick: The researchers found a way to make these sticky notes "louder" or "heavier" without retraining the whole AI. They simply scaled up (multiplied) the hidden data of these tokens.

Think of it like turning up the volume on a conductor’s baton.

Before: The conductor (the AI) is trying to lead three different orchestras (images) playing at the same time, but the conductor’s baton is too quiet. The musicians get confused and start playing each other's music.
After: The researchers make the conductor’s baton glow and shout. Now, the violinists (Image 1) clearly hear the conductor and stop looking at the drummers (Image 2). The separation becomes crystal clear.

Why This is a Big Deal

It’s Free: Usually, to fix an AI, you have to feed it millions of new pictures and spend weeks training it (which costs a fortune in electricity and time). This method requires zero training. You just turn up the volume on the existing tokens.
It’s Instant: It doesn't slow the AI down. It works in real-time, just like the original.
It Works Everywhere: It’s not just for pictures. The researchers showed it works for reading multiple documents or looking at multiple spreadsheets too. It helps the AI keep different "files" separate in its mind.

The Result

By making these invisible "walls" between images stronger, the AI stops mixing up its facts.

Old AI: "Is there a man on a bike in both photos?" -> "Yes!" (Even though the bike is only in one).
New AI: "Is there a man on a bike in both photos?" -> "No, only in the second one."

In short, the researchers didn't build a smarter AI; they just gave the existing AI better glasses so it can clearly see where one picture ends and the next one begins.

1. Problem Statement

Large Vision-Language Models (LVLMs) perform well on single-image tasks but suffer significant performance degradation when processing multiple images simultaneously. The primary cause identified is cross-image information leakage, where the model fails to distinguish between different input images, leading to the intermixing of visual features and context in the generated output.

While existing LVLMs utilize special delimiter tokens (e.g., <|vision start|>, <|vision end|>) to separate image sequences, the authors' analysis reveals that these tokens are insufficient. They do not fully isolate visual contexts, allowing unwanted attention to flow between tokens of different images, which results in reasoning errors.

2. Methodology

The authors propose a training-free, inference-time method called Delimiter Token Scaling. The approach is based on a deep analysis of how delimiter tokens function within the attention mechanism of LVLMs.

Key Insights from Analysis

The authors identified two critical properties of delimiter tokens in multi-image settings:

Localized Attention Absorption: Delimiter tokens receive strong attention from tokens within their corresponding image, acting as a "sink" for that specific image block.
Image Tagging Effect: This strong attention creates a "tagging" mechanism. In the attention output formula ( $Attention = \sum p_{q,i}v_i$ ), the term involving the delimiter token ( $p_{d}v_{d}$ ) acts as a shared additive bias for all tokens within the same image, reinforcing intra-image interactions.

However, the authors observed that the magnitude of this effect is often too weak to completely block cross-image leakage.

The Proposed Solution: Hidden State Scaling

To amplify these properties without retraining, the authors propose scaling the hidden states of the delimiter tokens before they are projected into Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) vectors.

Mechanism: For a set of delimiter token indices $D$ and a scaling factor $\lambda > 1$ , the hidden state $h_t$ at layer $l$ is modified as:
$h_t^{(l)*} = \begin{cases} \lambda \cdot h_t^{(l)} & \text{if } t \in D \\ h_t^{(l)} & \text{otherwise} \end{cases}$
Effect:
- Enhanced Property 1: Scaling increases the activation of delimiter tokens, causing them to attract even more attention from their corresponding image tokens (acting as stronger sinks).
- Enhanced Property 2: Because the Value vectors ( $V$ ) of the delimiter tokens are also scaled, the "shared bias" term ( $p_d v_d$ ) in the attention output becomes significantly larger. This reinforces intra-image interactions while the softmax normalization naturally suppresses attention to tokens from other images.

3. Key Contributions

Novel Analysis: The paper provides the first detailed analysis of delimiter tokens in LVLMs, distinguishing their "localized sink" behavior from the global sink behavior seen in text-only LLMs.
Training-Free Efficiency: The method requires no additional training and introduces zero inference overhead (no increase in memory or latency). It is compatible with optimized attention kernels like FlashAttention.
Generalizability: The method is not limited to images; it is shown to improve performance on text-only multi-instance tasks (multi-document and multi-table understanding) where clear separation of input units is required.

4. Experimental Results

The method was evaluated across various benchmarks and model families (Qwen2.5-VL, InternVL3, LLaVA-OneVision).

Multi-Image Benchmarks:
- Mantis, MuirBench, MIRB, QBench2: Consistent performance improvements were observed across all models and sizes (from 0.5B to 78B parameters).
- Example: On the MuirBench benchmark, Qwen2.5-VL-3B improved from 37.31 to 42.42.
Text-Only Benchmarks:
- Multi-Document (MultiNews, WCEP-10) & Multi-Table (TQABench): The method improved ROUGE scores and accuracy, proving its utility in separating distinct text documents or tables.
Qualitative Analysis:
- Attention maps show clear "triangular" block patterns after scaling, indicating strict image boundaries.
- Cross-image interaction scores dropped by approximately 50%, while intra-image interactions were preserved.
Ablation Studies:
- Scaling only Query, Key, or Value individually yielded improvements, but scaling the full hidden state (affecting all three) was most effective.
- Replacing delimiter tokens with generic special tokens or scaling the first token (BOS) did not yield the same results, confirming the specific role of image delimiters.
Cost: Table 8 confirms that average and peak VRAM usage, as well as inference time, remained identical to the baseline.

5. Significance

This work addresses a fundamental bottleneck in multi-modal AI: the inability of current models to cleanly separate multiple inputs. By leveraging a simple mathematical scaling of hidden states, the authors achieve state-of-the-art improvements in multi-image reasoning without the computational cost of fine-tuning or architectural changes. This offers a highly practical, "plug-and-play" solution for deploying LVLMs in real-world scenarios involving complex, multi-source visual data.

Enhancing Multi-Image Understanding through Delimiter Token Scaling

The Problem: The "Cafeteria Chaos"

The Current Fix: The "Sticky Note" (That Doesn't Stick)

The Solution: The "Super-Strong Wall"

Why This is a Big Deal

The Result

1. Problem Statement

2. Methodology

Key Insights from Analysis

The Proposed Solution: Hidden State Scaling

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation