Enhancing Multi-Image Understanding through Delimiter Token Scaling

This paper proposes a training-free method that scales the hidden states of delimiter tokens in Large Vision-Language Models to effectively mitigate cross-image information leakage, thereby significantly improving performance on multi-image and multi-document understanding benchmarks without incurring additional computational costs.

Minyoung Lee, Yeji Park, Dongjun Hwang, Yejin Kim, Seong Joon Oh, Junsuk Choe

Published 2026-02-26
📖 4 min read☕ Coffee break read

The Problem: The "Cafeteria Chaos"

Imagine you are a chef (the AI) trying to cook three different recipes at the same time. You have three separate recipe cards on your counter: one for a cake, one for soup, and one for a salad.

In a perfect world, you would read the cake card, cook the cake, put it aside, then read the soup card, and so on.

But in the current version of these AI models, it’s like the recipe cards are all sliding together into one giant, messy pile. When you try to read the "cake" instructions, your brain accidentally grabs a piece of "soup" instruction. You end up putting salt in the cake or putting whipped cream on the soup.

In technical terms, this is called Cross-Image Information Leakage. The AI looks at multiple images at once, but it gets confused about which details belong to which picture. It mixes them up, leading to wrong answers.

The Current Fix: The "Sticky Note" (That Doesn't Stick)

To stop this mess, engineers already put special "sticky notes" (called Delimiter Tokens) between the images.

  • Image 1 -> [STICKY NOTE] -> Image 2 -> [STICKY NOTE] -> Image 3

The idea is that the sticky note acts as a wall, telling the AI, "Stop! What you just read was Image 1. What comes next is Image 2."

However, the researchers in this paper found that these sticky notes are made of weak paper. The AI can still see through them. The "soup" instructions are still leaking into the "cake" section. The AI knows there is a wall, but it’s not strong enough to stop the confusion.

The Solution: The "Super-Strong Wall"

The researchers asked: What if we made those sticky notes super strong?

They discovered that these "sticky notes" (delimiter tokens) act like magnets or anchors for the AI's attention.

  1. The Anchor Effect: When the AI looks at a specific image, it naturally looks at the sticky note next to it to say, "Okay, I'm in this zone."
  2. The Scaling Trick: The researchers found a way to make these sticky notes "louder" or "heavier" without retraining the whole AI. They simply scaled up (multiplied) the hidden data of these tokens.

Think of it like turning up the volume on a conductor’s baton.

  • Before: The conductor (the AI) is trying to lead three different orchestras (images) playing at the same time, but the conductor’s baton is too quiet. The musicians get confused and start playing each other's music.
  • After: The researchers make the conductor’s baton glow and shout. Now, the violinists (Image 1) clearly hear the conductor and stop looking at the drummers (Image 2). The separation becomes crystal clear.

Why This is a Big Deal

  1. It’s Free: Usually, to fix an AI, you have to feed it millions of new pictures and spend weeks training it (which costs a fortune in electricity and time). This method requires zero training. You just turn up the volume on the existing tokens.
  2. It’s Instant: It doesn't slow the AI down. It works in real-time, just like the original.
  3. It Works Everywhere: It’s not just for pictures. The researchers showed it works for reading multiple documents or looking at multiple spreadsheets too. It helps the AI keep different "files" separate in its mind.

The Result

By making these invisible "walls" between images stronger, the AI stops mixing up its facts.

  • Old AI: "Is there a man on a bike in both photos?" -> "Yes!" (Even though the bike is only in one).
  • New AI: "Is there a man on a bike in both photos?" -> "No, only in the second one."

In short, the researchers didn't build a smarter AI; they just gave the existing AI better glasses so it can clearly see where one picture ends and the next one begins.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →