V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

Imagine you have a super-smart robot friend (an LVLM, or Large Vision-Language Model) that can look at a picture and tell you exactly what's happening. It's like a detective that never misses a detail.

But what if you wanted to trick this detective into seeing something that isn't there? Maybe you want it to think a dog in the photo is actually a tiger, or that a horse is a donkey?

This is what adversarial attacks try to do. They add tiny, invisible "noise" to an image to confuse the robot. However, previous attempts were like trying to change a specific word in a book by shaking the whole table. The robot would get confused about the whole picture, not just the one thing you wanted to change.

Enter V-Attack, the new method described in this paper. Here is how it works, explained simply:

1. The Problem: The "Blurry Glasses" Effect

Imagine the robot looks at a photo through a pair of glasses that smears everything together. When it sees a dog, the glasses mix the dog's features with the grass, the sky, and the horse next to it.

Old Method: Attackers tried to poke the "dog" part of the image, but because the glasses were smearing everything, the robot got confused about the whole scene. It might say, "I see a dog... wait, is that a tiger? Or maybe a horse?" It was messy and imprecise.

2. The Discovery: Finding the "Pure Signal"

The researchers discovered that inside the robot's brain, there are two ways it processes information:

The "Global" View (Patch Features): This is the smudged, mixed-up view where the dog is tangled with the background.
The "Local" View (Value Features): This is a special, hidden layer where the robot keeps the pure, un-mixed details of the dog. It's like looking at the dog through a magnifying glass that blocks out the rest of the world.

The researchers realized: If we want to change the dog into a tiger, we shouldn't poke the smudged view. We should poke the pure, magnified view.

3. The Solution: V-Attack (The "Surgical Scalpel")

V-Attack is like a surgeon using a laser instead of a sledgehammer. It has two main tools:

Tool 1: The "Focus Lens" (Self-Value Enhancement)
Before attacking, V-Attack uses a special filter to make the "pure signal" of the dog even clearer. It sharpens the image of the dog in the robot's mind, ensuring the robot is 100% focused on the dog and nothing else.
Tool 2: The "Translator" (Text-Guided Manipulation)
The researchers tell the robot: "Look at the dog. Now, imagine it is a tiger."
Instead of messing with the whole picture, V-Attack finds the specific "pure signal" of the dog and gently nudges it to look like a tiger. Because it's only touching that one specific signal, the rest of the picture (the grass, the horse) stays perfectly normal.

4. The Result: A Master of Disguise

When they tested this on super-advanced robots like GPT-4o and GPT-o3 (which are known for being very smart and good at reasoning), the results were shocking:

Old methods failed to change the dog to a tiger more than 10% of the time.
V-Attack succeeded 36% more often than the best previous methods.

Even when the robot was asked to think hard about the animal's biology ("Does this animal have stripes?"), V-Attack tricked it into saying, "Yes, that's definitely a tiger," even though it was still a dog.

Why Does This Matter?

Think of this like a security system. If you can trick the security guard into thinking a harmless dog is a dangerous tiger, you can bypass the rules.

The Good News: This paper helps us understand how these smart robots think. By finding their weak spots (the "Value Features"), we can build better defenses.
The Bad News: It shows that even the smartest AI models today can be fooled very easily if you know where to poke them.

In short: V-Attack is a new way to trick AI by finding the "purest" part of its brain and surgically changing just one thing, leaving the rest of the world untouched. It's like changing a single word in a sentence without changing the grammar of the whole paragraph.

Here is a detailed technical summary of the paper "V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs".

1. Problem Statement

Adversarial attacks on Large Vision-Language Models (LVLMs) have evolved from simple misclassification tasks to complex semantic manipulations. However, existing methods suffer from a critical lack of controllability. They struggle to precisely alter specific semantic concepts within an image (e.g., changing a "dog" to a "tiger" while leaving the rest of the scene untouched) without affecting global semantics or other objects.

The authors attribute this failure to semantic entanglement in the standard features used for attacks:

Patch Token Features ( $X$ ): In Vision Transformers (ViT), self-attention mechanisms aggregate global context into patch tokens. This causes the unique local semantics of a patch to be diluted by global information, making them unreliable handles for precise, local manipulation.
Current Limitations: Existing attacks targeting these entangled patch features result in unfocused perturbations, leading to low success rates (often below 10% for multi-concept changes) and poor transferability.

2. Core Insight: Value Features ( $V$ )

Through systematic analysis of CLIP-like architectures, the authors identify a superior target for manipulation: Value features ( $V$ ) computed within the transformer attention blocks.

Disentanglement: Unlike patch features ( $X$ ), value features suppress the dominant global-context channels (often associated with the [CLS] token).
Local Richness: $V$ retains high-entropy, disentangled local semantic information.
Evidence: Analysis shows $V$ has a more uniform channel distribution, higher entropy in middle layers, and significantly better alignment with specific text prompts (e.g., "dog") compared to $X$ , which exhibits chaotic similarity maps.

3. Methodology: V-Attack

The proposed V-Attack framework is designed for transfer-based black-box attacks. It targets the value features of an ensemble of surrogate models to generate perturbations effective against unseen target LVLMs. The method consists of two core modules:

A. Self-Value Enhancement

To further refine the semantic richness of the extracted value features ( $V$ ), the authors apply a Self-Attention mechanism where the Query, Key, and Value inputs are all derived from $V$ itself.

Goal: This forces the features to reinforce their internal correlations, enhancing salient local semantics and improving feature coherence across patch tokens before manipulation.

B. Text-Guided Value Manipulation

This module performs precise, surgical manipulation of the enhanced value features ( $\tilde{V}$ ) using text prompts:

Value Location: The system computes the cosine similarity between the enhanced value features and the text embedding of the source concept (e.g., "dog"). A dynamic threshold is used to create a binary mask, identifying the specific value features ( $I_{align}$ ) that correspond to the source object.
Semantic Manipulation: An ensemble loss function is optimized to:
- Minimize alignment between the masked features and the source text.
- Maximize alignment between the masked features and the target text (e.g., "cat").
- This loss is applied only to the identified local features, leaving the rest of the image semantics untouched.

The perturbation $\delta$ is generated using an optimizer (e.g., PGD) under an $L_\infty$ constraint to ensure imperceptibility.

4. Key Contributions

Theoretical Discovery: The paper demonstrates that Value features are inherently disentangled from global context and serve as the optimal target for precise semantic attacks, contrasting with the entangled nature of standard patch features.
Novel Framework (V-Attack): Introduction of a two-stage attack method (Self-Value Enhancement + Text-Guided Manipulation) that enables fine-grained control over specific image concepts.
State-of-the-Art Performance: Extensive experiments show V-Attack significantly outperforms existing baselines (e.g., MF-ii, AnyAttack, SSA-CWA, M-Attack) across diverse open-source (LLaVA, InternVL, DeepseekVL) and commercial models (GPT-4o, GPT-o3, Gemini).

5. Experimental Results

Performance Gain: V-Attack improves the Attack Success Rate (ASR) by an average of 36% over state-of-the-art methods.
Specific Metrics:
- On the Image Captioning (CAP) task, V-Attack achieves an average ASR of 0.567 (vs. ~0.45 for the next best).
- On the Visual Question Answering (VQA) task, it achieves 0.560.
Commercial Models: The attack is highly effective against advanced reasoning models like GPT-o3 (ASR 0.589 on CAP) and Gemini-2.5-pro, proving that even models with "thought" processes are vulnerable to this semantic manipulation.
Imperceptibility: Unlike generation-based attacks (e.g., AnyAttack) that often introduce visible artifacts, V-Attack produces subtle perturbations that maintain high image quality and stealth.
Robustness: The method remains effective against common defenses like Gaussian Blur, JPEG compression, and Random Dropout.

6. Significance

Safety Implications: V-Attack exposes critical vulnerabilities in the visual semantic understanding of modern LVLMs. It demonstrates that even models with advanced reasoning capabilities can be fooled into misidentifying specific objects based on biological traits or context.
Defense Guidance: By pinpointing that the vulnerability lies in the entanglement of global and local semantics in value features, the paper provides a clear direction for future defense strategies (e.g., decoupling local semantics from global context during training).
Controllability Benchmark: The paper establishes a new standard for "Local Semantic Attacks," moving the field beyond coarse-grained global manipulation to precise, object-level control.

In conclusion, V-Attack represents a paradigm shift in adversarial attacks on LVLMs by leveraging the underutilized, disentangled nature of value features to achieve unprecedented levels of semantic control and transferability.

V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

1. The Problem: The "Blurry Glasses" Effect

2. The Discovery: Finding the "Pure Signal"

3. The Solution: V-Attack (The "Surgical Scalpel")

4. The Result: A Master of Disguise

Why Does This Matter?

1. Problem Statement

2. Core Insight: Value Features (VVV)

3. Methodology: V-Attack

A. Self-Value Enhancement

B. Text-Guided Value Manipulation

4. Key Contributions

5. Experimental Results

6. Significance

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation

2. Core Insight: Value Features ( $V$ )