V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

This paper introduces V-Attack, a novel adversarial attack method for Large Vision-Language Models that achieves precise local semantic manipulation by targeting disentangled value features within transformer attention blocks, thereby overcoming the controllability limitations of existing approaches that rely on entangled patch-token representations.

Sen Nie, Jie Zhang, Jianxin Yan, Shiguang Shan, Xilin Chen

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you have a super-smart robot friend (an LVLM, or Large Vision-Language Model) that can look at a picture and tell you exactly what's happening. It's like a detective that never misses a detail.

But what if you wanted to trick this detective into seeing something that isn't there? Maybe you want it to think a dog in the photo is actually a tiger, or that a horse is a donkey?

This is what adversarial attacks try to do. They add tiny, invisible "noise" to an image to confuse the robot. However, previous attempts were like trying to change a specific word in a book by shaking the whole table. The robot would get confused about the whole picture, not just the one thing you wanted to change.

Enter V-Attack, the new method described in this paper. Here is how it works, explained simply:

1. The Problem: The "Blurry Glasses" Effect

Imagine the robot looks at a photo through a pair of glasses that smears everything together. When it sees a dog, the glasses mix the dog's features with the grass, the sky, and the horse next to it.

  • Old Method: Attackers tried to poke the "dog" part of the image, but because the glasses were smearing everything, the robot got confused about the whole scene. It might say, "I see a dog... wait, is that a tiger? Or maybe a horse?" It was messy and imprecise.

2. The Discovery: Finding the "Pure Signal"

The researchers discovered that inside the robot's brain, there are two ways it processes information:

  • The "Global" View (Patch Features): This is the smudged, mixed-up view where the dog is tangled with the background.
  • The "Local" View (Value Features): This is a special, hidden layer where the robot keeps the pure, un-mixed details of the dog. It's like looking at the dog through a magnifying glass that blocks out the rest of the world.

The researchers realized: If we want to change the dog into a tiger, we shouldn't poke the smudged view. We should poke the pure, magnified view.

3. The Solution: V-Attack (The "Surgical Scalpel")

V-Attack is like a surgeon using a laser instead of a sledgehammer. It has two main tools:

  • Tool 1: The "Focus Lens" (Self-Value Enhancement)
    Before attacking, V-Attack uses a special filter to make the "pure signal" of the dog even clearer. It sharpens the image of the dog in the robot's mind, ensuring the robot is 100% focused on the dog and nothing else.

  • Tool 2: The "Translator" (Text-Guided Manipulation)
    The researchers tell the robot: "Look at the dog. Now, imagine it is a tiger."
    Instead of messing with the whole picture, V-Attack finds the specific "pure signal" of the dog and gently nudges it to look like a tiger. Because it's only touching that one specific signal, the rest of the picture (the grass, the horse) stays perfectly normal.

4. The Result: A Master of Disguise

When they tested this on super-advanced robots like GPT-4o and GPT-o3 (which are known for being very smart and good at reasoning), the results were shocking:

  • Old methods failed to change the dog to a tiger more than 10% of the time.
  • V-Attack succeeded 36% more often than the best previous methods.

Even when the robot was asked to think hard about the animal's biology ("Does this animal have stripes?"), V-Attack tricked it into saying, "Yes, that's definitely a tiger," even though it was still a dog.

Why Does This Matter?

Think of this like a security system. If you can trick the security guard into thinking a harmless dog is a dangerous tiger, you can bypass the rules.

  • The Good News: This paper helps us understand how these smart robots think. By finding their weak spots (the "Value Features"), we can build better defenses.
  • The Bad News: It shows that even the smartest AI models today can be fooled very easily if you know where to poke them.

In short: V-Attack is a new way to trick AI by finding the "purest" part of its brain and surgically changing just one thing, leaving the rest of the world untouched. It's like changing a single word in a sentence without changing the grammar of the whole paragraph.