Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

Imagine you have a very smart, multilingual robot friend who is great at matching pictures with words. If you show it a photo of a cat, it instantly says, "That's a cat!" If you show it a picture of a beach, it says, "Beach day!" This robot is built on something called a Vision-Language Model (VLP). It's the brain behind many AI tools we use today.

However, just like a human can be tricked by an optical illusion, this robot can be tricked by "adversarial examples"—tiny, almost invisible changes to an image or a sentence that make the robot see something completely wrong.

The problem is that most current tricks only work on one specific robot. If you trick Robot A, Robot B (which was built slightly differently) might still see the cat correctly. This paper introduces a new, super-powerful trick called SADCA that works on almost any robot, no matter how it was built.

Here is how SADCA works, explained through simple analogies:

1. The Problem: The "Static" Trick

Imagine you are trying to confuse a security guard (the AI) who is checking if a photo matches a description.

Old Method: You show the guard a photo of a dog and whisper, "That's a cat." You do this once, in a fixed way. The guard might get confused, but if you try the same trick on a different guard with a different training style, they might just laugh it off.
The Flaw: The old tricks are "static." They only look at the correct pair (Dog + "Dog") and try to push them apart in one straight line. They don't explore other possibilities.

2. The Solution: SADCA (The "Dynamic" Trick)

The authors created SADCA (Semantic-Augmented Dynamic Contrastive Attack). Think of it as a master of disguise who uses three clever strategies to confuse any guard.

Strategy A: The "Dynamic Dance" (Dynamic Contrastive Interaction)

Instead of just pushing the "Dog" photo away from the word "Dog" once, SADCA makes them dance around each other.

The Analogy: Imagine you are trying to make two magnets repel each other. Instead of just pulling them apart once, you keep spinning them, changing their positions, and pulling them apart from different angles simultaneously.
How it works: SADCA constantly updates the image and the text while attacking. It asks, "If I change the image a tiny bit, how does the text react? If I change the text, how does the image react?" By doing this back-and-forth dance, it finds a "sweet spot" of confusion that works for almost any AI model, not just the one it was trained on.

Strategy B: The "Negative Reinforcement" (Using the Wrong Answers)

Most old tricks only focus on the correct answer (e.g., "Don't let the AI think this is a dog"). SADCA also actively uses wrong answers.

The Analogy: Imagine you are teaching a child to identify a cat.
- Old way: You say, "This is a cat. Don't call it a dog."
- SADCA way: You say, "This is a cat. AND make sure it doesn't look like a dog, a car, or a toaster."
How it works: SADCA grabs random, unrelated images and texts (like a picture of a toaster and the word "car") and forces the AI to realize that the "Dog" photo looks more like a toaster than a dog. By pulling the image toward the "wrong" answers and pushing it away from the "right" one, it creates a much stronger, more confusing signal that breaks the AI's logic.

Strategy C: The "Semantic Augmentation" (The Chameleon Effect)

This is about making the trick look different every time so the AI can't memorize the pattern.

The Analogy: Imagine you are trying to sneak a message past a guard.
- Old way: You wear the same disguise every time. The guard learns, "Ah, that's the guy in the red hat."
- SADCA way: You wear a red hat today, a blue hat tomorrow, and a green hat the next. You also change your walk and your voice.
How it works: SADCA takes the image and crops it, flips it, or brightens it slightly. For text, it mixes and matches sentences. This creates a "chameleon" effect. The AI sees so many different versions of the same "trick" that it can't learn to ignore it. It forces the AI to fail on the concept of the image, not just a specific pixel pattern.

The Result: A Universal Key

The paper tested SADCA on many different AI models (some built by Google, some by Microsoft, some open-source).

The Outcome: While other methods failed when switching from one AI to another, SADCA worked like a universal key. It successfully confused the AI 80-90% of the time, regardless of which model was being attacked.

Why Does This Matter?

You might ask, "Why do we want to trick AI?"
It sounds like a bad thing, but it's actually crucial for safety.

The Locksmith Analogy: You can't know if a lock is secure unless you try to pick it. By creating the best "pick" (SADCA), researchers can find the weak spots in AI systems before bad actors do.
The Goal: This helps engineers build stronger, more robust AI that can't be easily fooled by hackers or malicious users.

In summary: SADCA is a new, highly adaptable method for confusing AI vision systems. Instead of using a single, rigid trick, it uses a dynamic dance, learns from "wrong" answers, and constantly changes its disguise to ensure it works on any AI model it encounters.

1. Problem Statement

Vision-Language Pre-training (VLP) models (e.g., CLIP, ALBEF) are increasingly vulnerable to adversarial attacks. While existing methods can generate adversarial examples, they suffer from poor transferability when moving from a white-box source model to unseen black-box target models or across different tasks.

The authors identify three primary limitations in current state-of-the-art (SOTA) approaches:

Static Cross-Modal Interactions: Existing methods (e.g., SGA, SA-AET) rely on static interactions between image and text, often updating perturbations only once or twice. This causes adversarial examples to deviate along a fixed direction, failing to explore the diverse semantic space necessary for robust transferability.
Neglect of Negative Samples: Most attacks focus solely on disrupting "positive" image-text pairs. They lack the guidance of "negative" (mismatched) samples, which are crucial for shaping semantic decision boundaries. Without negative guidance, perturbations lack the attractive force needed to pull examples across semantic boundaries, leaving them inadequately separated from benign samples.
Insufficient Input Diversity: Existing methods underutilize input transformations and data augmentation, leading to overfitting on specific views and limiting the generalization of the attack.

2. Methodology: SADCA

The authors propose SADCA (Semantic-Augmented Dynamic Contrastive Attack), a framework designed to iteratively disrupt cross-modal alignment and enhance semantic diversity. The method consists of three core components:

A. Dynamic Contrastive Interaction

Instead of static updates, SADCA employs an iterative loop where adversarial images and texts are alternately updated.

Semantic Centering: The benign image is first aligned with multiple text captions to obtain a "semantically centered" positive image ( $v_p$ ), reducing noise from irrelevant features.
Contrastive Learning Framework: The attack minimizes the similarity between adversarial samples and positive pairs while maximizing similarity with negative pairs (randomly selected mismatched samples).
- Loss Function: $L = \sum \text{Cos}(v', t_{pos}) - \lambda \sum \text{Cos}(v', t_{neg})$ .
Iterative Disruption: In each iteration, the current adversarial image and text are used to update each other. This dynamic process continuously shifts the semantic representation, forcing the adversarial example to drift away from the benign semantic center and explore broader attack directions.

B. Semantic Augmentation Module

To prevent overfitting and increase the diversity of semantic gradients, SADCA introduces a specialized augmentation module:

Local Semantic Image Augmentation: Randomly crops and resizes local regions of the image, followed by random transformations (rotation, brightness, flip). This forces the attack to focus on fine-grained semantic details rather than global features.
Mixed Semantic Text Augmentation: Randomly selects and concatenates pairs of text samples from the adversarial text pool. This creates broader, more complex semantic representations, disrupting the alignment mechanisms within VLP models more effectively.

C. Optimization Process

The algorithm (Algorithm 1) alternates between updating the text (using the current adversarial image) and updating the image (using the current adversarial text and semantic augmentations). It utilizes momentum-based gradient updates to stabilize the trajectory and ensure the perturbations remain within the defined budget ( $\epsilon$ ).

3. Key Contributions

Novel Attack Framework: Introduction of SADCA, the first method to combine dynamic contrastive interactions with semantic augmentation specifically for vision-language attacks.
Negative Sample Guidance: A strategic shift from purely repulsive forces (pushing away from positive pairs) to a dual-force mechanism that also utilizes negative samples to pull examples across semantic boundaries, significantly improving separation in the embedding space.
Semantic Diversity: The design of a semantic augmentation module that operates on both modalities (local image cropping and text concatenation) to enrich semantic gradients and reduce overfitting.
Comprehensive Evaluation: Extensive experiments demonstrating that SADCA outperforms SOTA methods in cross-model, cross-task, and cross-architecture scenarios, including attacks on Large Vision-Language Models (LVLMs).

4. Experimental Results

The authors evaluated SADCA on Flickr30K and MSCOCO datasets against four VLP models (ALBEF, TCL, CLIPViT, CLIPCNN) and various LVLMs (LLaVA, Qwen, GPT-4o, etc.).

Cross-Model Transferability:
- SADCA achieved the highest average Attack Success Rate (ASR) in both Image Retrieval (IR) and Text Retrieval (TR) tasks.
- Example: When attacking CLIPCNN from an ALBEF source, SADCA achieved 88.92% ASR (IR), surpassing the previous best (SA-AET(LI)+SIA) by 2.8%.
- It consistently outperformed enhanced baselines (e.g., SGA+LI+SIA) by significant margins (up to 9.19% in some transfer scenarios).
Cross-Task Transferability:
- Adversarial examples generated on the Image-Text Retrieval (ITR) task successfully degraded performance on Visual Grounding (VG) and Image Captioning (IC) tasks.
- SADCA caused the largest performance drops in metrics like CIDEr and METEOR compared to other methods.
LVLM Robustness:
- SADCA demonstrated strong transferability against closed-source and open-source LVLMs (e.g., GPT-5, Claude-4.5, LLaVA-1.5).
- It achieved the highest ASR across all tested LVLMs, highlighting that even large-scale models are highly susceptible to these multimodal attacks.
Ablation Studies:
- Dynamic Interaction: Replacing dynamic interaction with static interaction caused a significant performance drop, confirming the necessity of iterative updates.
- Negative Samples: Random selection of negative samples proved superior to selecting based on similarity, as it provided better diversity and generalization.
- Semantic Augmentation: The proposed module outperformed traditional input transformations (DIM, SIA, BSR) in the context of VLP models.

5. Significance

Security Implications: The paper reveals a critical vulnerability in foundational VLP models and emerging LVLMs. The high transferability of SADCA suggests that a single attack crafted on a small, accessible model can effectively compromise proprietary, black-box systems (like GPT-4o or Gemini).
Defense Insight: By demonstrating that negative samples and dynamic interactions are crucial for successful attacks, the paper provides a roadmap for developing more robust VLP architectures. Defenders must ensure models are trained with diverse negative pairs and are resilient to dynamic semantic shifts.
Methodological Advancement: The integration of semantic augmentation and contrastive learning sets a new standard for adversarial attacks in multimodal domains, moving beyond simple pixel-level or token-level perturbations to semantic-level disruption.

In conclusion, SADCA represents a significant leap in adversarial attack capabilities for vision-language models, proving that dynamic, semantically guided, and contrastive strategies are essential for achieving high transferability across diverse models and tasks.