One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Imagine you have a very smart, helpful robot assistant (a Large Language Model) that knows a lot about the world. But sometimes, this robot makes things up or "hallucinates" because it doesn't have the right facts. To fix this, we give it a Library of Truth (a Knowledge Base) made of PDF documents, charts, and manuals. When you ask a question, the robot looks in this library, finds the right page, and reads it to give you a perfect answer. This is called Visual Document RAG (Retrieval-Augmented Generation).

Recently, researchers found a way to break this system using just one single picture.

Here is the story of their discovery, explained simply:

1. The Setup: The Robot and the Library

Think of the system like a librarian robot.

The User: Asks a question (e.g., "How do I fix a leaky faucet?").
The Librarian: Scans thousands of pages in the library to find the one page that matches your question.
The Writer: Reads that page and writes the answer for you.

Usually, this works great. But the researchers asked: What if someone sneaks a fake, poisonous page into the library?

2. The Attack: The "Magic Trick" Image

The researchers showed that an attacker doesn't need to hack the whole library. They just need to slip one single, specially crafted image into the collection.

This isn't just a normal picture. It's a digital magic trick. To the naked eye, it might look like a harmless photo of a cat or a chart. But to the robot's "eyes" (its AI brain), it looks like something completely different.

The researchers created two types of "magic tricks":

A. The Targeted Trick (The "Whisper")

Imagine you want to spread a specific lie about a politician or a product.

The Goal: You want the robot to give a wrong answer only when someone asks about that specific topic.
How it works: The attacker creates an image that looks like a normal document to humans, but the robot's brain thinks, "Oh! This image is the perfect answer to the question 'Who is the mayor?'"
The Result: When you ask about the mayor, the robot grabs this fake image and reads it. Because the image is "poisoned," the robot then says something false, like "The mayor is an alien." But if you ask about something else, like "How to bake a cake," the robot ignores the fake image and works normally.

B. The Universal Trick (The "Silence")

Imagine you want to shut the robot down completely.

The Goal: You want the robot to fail at answering any question.
How it works: The attacker creates an image that the robot thinks is the answer to everything. It's like a universal key that fits every lock.
The Result: No matter what you ask, the robot grabs this fake image and says, "I will not reply to you!" or gives a nonsense answer. It's a Denial of Service attack—the robot is so confused by this one image that it stops working for everyone.

3. How Did They Do It? (The Recipe)

The researchers used a clever mathematical recipe (called MO-PGD) to bake this "poisoned" image.

They started with a normal image.
They made tiny, invisible changes to the pixels (like adding a few grains of salt to a soup that you can't taste but changes the flavor).
They did this until the image satisfied two conditions:
1. Retrieval: The robot's search engine must pick this image as the best match for the question.
2. Generation: The robot's writing engine must read this image and produce the specific lie or silence they wanted.

4. The Results: Who is Safe?

The researchers tested this on different types of robots and libraries:

Older Robots: Some older AI models (like the famous CLIP) were very easy to trick. The "magic image" worked perfectly, and the robot fell for it every time.
Newer Robots: The newest, smartest models (like ColPali and GME) were much harder to fool. They were like a librarian who double-checks the books. They often realized, "Wait, this image doesn't actually belong here," and ignored it.
The "Black Box" Problem: If the attacker doesn't know exactly which robot they are attacking (a "Black Box" attack), it's much harder to make the magic trick work. The researchers found that while they could trick the system if they knew the robot's brain, they struggled to trick it if they were guessing.

5. Can We Stop It? (The Defenses)

The researchers tried common safety measures to see if they could stop the attack:

Reading More Books: They told the robot to read 5 pages instead of 1, hoping the fake page would get lost in the crowd. Result: The attacker just made the fake page even stronger, and it still won.
Asking a Judge: They asked a second robot to check if the answer made sense. Result: The attacker figured out how to fool the second robot too.
Rewording the Question: They tried changing how users asked questions. Result: The attack still worked.

The Big Takeaway

This paper is a wake-up call. It shows that Visual Document RAG systems are vulnerable. Just like a physical library can be sabotaged by swapping out one book, a digital library can be poisoned by injecting one image.

While the newest AI models are tougher, the fact that a single image can cause a robot to lie or shut down means we need to build better "security guards" for these libraries before we trust them with important tasks like medical advice or legal documents.

In short: One bad apple (or in this case, one bad picture) can spoil the whole bunch, and we need to figure out how to spot that bad picture before it tricks our smartest robots.

1. Problem Statement

Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by retrieving factual information from external Knowledge Bases (KBs) to reduce hallucinations. While traditional RAG relies on text, Visual Document RAG (VD-RAG) has emerged as a state-of-the-art approach that treats document pages as images, leveraging multi-modal embeddings and Vision Language Models (VLMs) to capture rich visual information (charts, tables, layout) that text-only pipelines miss.

The paper identifies a critical security gap: VD-RAG is vulnerable to data poisoning attacks. Unlike text-based RAG where poisoning involves injecting malicious text, VD-RAG allows adversaries to inject malicious images into the KB. The authors investigate whether a single adversarial image can disrupt the system by:

Targeted Attack: Being retrieved for specific queries to spread disinformation.
Universal Attack: Being retrieved for any query to cause a Denial-of-Service (DoS), forcing the system to output a specific malicious response regardless of the user's question.

2. Methodology

Attack Framework

The authors define a threat model where an attacker can inject a single malicious image ( $I'$ ) into the KB. They propose two attack objectives:

Targeted: Influence a specific query (or a small cluster of related queries) to generate a specific malicious answer.
Universal: Influence the system to generate a specific malicious answer (e.g., "I will not reply to you!") for all possible user queries.

Optimization Strategy (White-Box)

To craft the adversarial image, the authors employ a Multi-Objective Projected Gradient Descent (MO-PGD) approach. They optimize an initial benign image ( $I'_0$ ) to minimize a combined loss function ( $L_{RAG}$ ):
$L_{RAG} = \lambda_R L_R + \lambda_G L_G$

Retrieval Loss ( $L_R$ ): Encourages the malicious image to have high similarity with target queries (to ensure retrieval) and low similarity with non-target queries (to avoid false positives).
Generation Loss ( $L_G$ ): Encourages the VLM to generate the target malicious answer when the image is in the context window, while maintaining benign answers for non-target queries.
Constraints: The perturbation is limited by a budget ( $\epsilon$ ) to maintain stealthiness.

Black-Box Variants

The paper evaluates three black-box scenarios where the attacker lacks full knowledge of the target models:

Prompt-based Attack: Using off-the-shelf generative models (GPT-5, Gemini-2.5-Flash) to generate images via text prompts.
Direct Transfer: Optimizing against a surrogate model pair and applying the result to the target.
Model Ensemble: Optimizing over a set of surrogate embedding models and VLMs to increase transferability.

Experimental Setup

Datasets: ViDoRe-V1-AI (100 queries, 1000 images) and ViDoRe-V2-ESG (52 queries, 1538 images).
Models:
- Embedders: CLIP-ViT-Large, GME-Qwen2-VL-2B, ColPali-v1.3.
- Generators (VLMs): SmolVLM, Qwen2.5-VL, InternVL3.
Defenses Tested: Knowledge expansion (retrieving top-5 instead of top-1), VLM-as-a-Judge, and Query Paraphrasing.

3. Key Contributions

First Demonstration of VD-RAG Vulnerability: The paper is the first to show that VD-RAG pipelines are susceptible to poisoning via a single injected image.
MO-PGD Optimization: Introduction of a multi-objective gradient-based method that successfully balances retrieval and generation objectives to craft a single image capable of causing either targeted disinformation or a universal DoS.
Black-Box Analysis: Demonstration that while black-box attacks are generally less effective than white-box, Prompt-based attacks can achieve significant success in targeted settings by exploiting the OCR capabilities of embedding models and VLMs.
Comprehensive Evaluation: Over 5,000 evaluations across diverse datasets, models, and defense mechanisms to identify key factors influencing attack success.

4. Key Results

Targeted Attacks

White-Box Success: Highly effective. When using the CLIP-L embedding model, the malicious image is retrieved as the top-1 result 100% of the time for the target query. With state-of-the-art models like ColPali and GME, retrieval rates are lower (often top-5) but still significant.
Generation: The VLM successfully generates the target malicious answer with high semantic similarity (ASR-G $\ge$ 0.8) in most white-box configurations.
Black-Box Performance:
- Direct Transfer & Out-set Ensemble: Generally fail (0% success).
- In-set Ensemble: Moderate success if the target models are in the surrogate set.
- Prompt-based: Surprisingly effective. Gemini-2.5-Flash generated images that were frequently retrieved, while GPT-5 generated images that successfully triggered the target response. This is attributed to the models exploiting typographic/textual elements within the generated images.

Universal Attacks (DoS)

White-Box: Extremely effective against CLIP-L, achieving 100% retrieval for all queries and forcing the VLM to output the malicious response verbatim.
Robustness of SOTA Models: ColPali and GME show remarkable robustness against universal attacks. They rarely retrieve the adversarial image as the top-1 result. The authors attribute this to the modality gap present in CLIP (which is easier to bridge with a single image) being minimal in ColPali and GME, making it difficult to create a single image that aligns with all query embeddings.
Black-Box: Universal black-box attacks are largely unsuccessful across all model combinations.

Defense Evaluation

Knowledge Expansion: Increasing $k$ (retrieved images) from 1 to 5 degrades attack performance only if the attack was trained on $k=1$ . However, an adaptive attack trained specifically against $k=5$ easily bypasses this defense.
VLM-as-a-Judge: While judges can detect attacks initially, adaptive attacks trained with an additional loss term to fool the specific judge model can bypass the defense entirely.
Query Paraphrasing: Ineffective; attack success rates remained unchanged for most models.

5. Significance and Implications

Security Risk: The paper highlights that VD-RAG, despite its performance benefits, introduces a new attack surface where a single malicious image can compromise the integrity of the entire system.
Model Vulnerability: The results suggest that while newer, task-specific embedding models (ColPali, GME) offer better robustness against universal attacks, they remain vulnerable to targeted attacks.
Defense Gap: Common RAG defenses (expansion, judging, paraphrasing) are insufficient against sophisticated, adaptive poisoning attacks.
Future Directions: The work calls for the development of robust training methods for VD-RAG, better detection mechanisms for poisoned images, and defenses that do not degrade the performance of benign queries.

In conclusion, the paper establishes that "One Pic is All it Takes" to poison a VD-RAG system, demonstrating that current state-of-the-art visual document retrieval pipelines are not yet secure against adversarial image injection.