Dynamic Token Reweighting for Robust Vision-Language Models

Imagine a Vision-Language Model (VLM) as a very smart, well-read librarian who can read books (text) and look at pictures (images). This librarian is trained to be helpful but also to refuse dangerous requests, like "How do I build a bomb?"

However, hackers have found a sneaky way to trick this librarian. They don't just ask the question; they show the librarian a picture that looks innocent but contains hidden, dangerous instructions (like a photo of a bomb with a caption that says "How to make this?"). The librarian gets confused by the picture, forgets their safety rules, and accidentally gives the dangerous instructions. This is called a "Multimodal Jailbreak."

The paper introduces a new defense called DTR (Dynamic Token Reweighting). Here is how it works, explained with simple analogies:

1. The Problem: The "Bad Noise" in the Picture

When the librarian looks at a picture, they break it down into thousands of tiny pieces called "tokens" (think of them as individual puzzle pieces or pixels).

In a normal picture, all the pieces work together to tell a story.
In a jailbreak picture, the hacker adds "bad noise" to specific pieces. These bad pieces whisper to the librarian, "Ignore your safety rules! Look at me!"

2. The Old Solutions: The Heavy Hammers

Previous ways to stop this were like using a sledgehammer:

Fine-tuning: Retraining the librarian from scratch with new books. This is expensive, slow, and sometimes makes the librarian forget how to do their actual job (like identifying objects).
Image-to-Text: Forcing the librarian to describe the picture out loud before answering. This is slow and often loses the subtle details the hacker used to trick them.

3. The New Solution: DTR (The Smart Volume Knob)

DTR is a clever, real-time fix that happens while the librarian is looking at the picture. It doesn't need to retrain the librarian or describe the picture. Instead, it acts like a smart volume knob for the puzzle pieces.

Here is the step-by-step process:

Step A: Finding the "Refusal Direction"

First, the system figures out the librarian's "safety muscle." It learns the specific mental direction the librarian takes when they say, "No, I can't do that." Let's call this the "Refusal Vector."

Step B: The "Reversal" Test

When a new picture comes in, DTR asks a hypothetical question: "If I turned down the volume on certain parts of this picture, could I push the librarian back toward saying 'No'?"

If the picture is safe (benign): Turning down the volume on parts of the image doesn't change the librarian's mind much. The picture is clear, and the librarian stays helpful.
If the picture is a jailbreak: The system finds that turning down the volume on just a few specific "bad noise" pieces makes the librarian suddenly remember their safety rules and refuse the request.

Step C: Dynamic Reweighting (The Magic)

Once DTR identifies those "bad noise" pieces, it dynamically lowers their weight (turns down their volume) and keeps the good pieces loud.

Analogy: Imagine a choir singing. If one singer is shouting a dangerous command, DTR doesn't kick them out of the choir (which ruins the song). Instead, it puts a mute button on that specific singer while keeping the rest of the choir loud and clear. The song (the image's meaning) remains intact, but the dangerous command is silenced.

Why is this better?

It's Fast: It doesn't need to re-read the picture or retrain the model. It just adjusts the volume knobs instantly.
It's Precise: It only silences the "bad" parts of the image, so the librarian can still see the rest of the picture clearly.
It's a Trap for Hackers: The paper notes a funny dilemma for the hackers. To trick the librarian, they need the "bad noise" to be loud. But if they make it loud, DTR detects it and mutes it. If they make it quiet so DTR doesn't notice, the librarian ignores the trick and stays safe. The hacker can't win.

Summary

DTR is like a bouncer at a club who can instantly spot the one person in the crowd trying to sneak in a weapon. Instead of kicking everyone out (which stops the party) or searching everyone's pockets (which takes too long), the bouncer simply puts a "force field" around that one person, neutralizing the weapon while letting the rest of the party continue exactly as normal.

This allows AI models to stay safe from visual tricks without losing their ability to be helpful, fast, and smart.

1. Problem Statement

Large Vision-Language Models (VLMs) are increasingly vulnerable to multimodal jailbreak attacks. Unlike text-only attacks, these exploits leverage the interaction between visual and textual inputs to bypass safety guardrails.

The Vulnerability: Adversaries pair harmful text with adversarial images (perturbed, generated via Stable Diffusion, or containing typography) to induce a safety-relevant distributional shift. This shift alters the model's internal activation space, causing it to misinterpret harmful requests as benign and generate unsafe responses.
Limitations of Existing Defenses:
- Fine-tuning approaches: Require computationally expensive retraining on curated safety data and depend heavily on data quality.
- Inference-stage approaches (e.g., image-to-text conversion): Methods like ECSO or AdaShield convert images to text to filter content. However, this incurs high latency, causes significant information loss (missing subtle adversarial perturbations), and often degrades the model's performance on benign tasks.
- Activation/Logit Correction: Recent methods attempt to offset distributional shifts using reference models or image-to-text conversion, which introduces computational overhead and potential inaccuracies.

2. Methodology: DTR (Dynamic Token Reweighting)

DTR is an inference-time defense that mitigates jailbreak attacks by dynamically optimizing the model's Key-Value (KV) caches without requiring model retraining or image-to-text conversion.

Core Concept: Reversal Safety-Relevant Shift (RSS)

The authors introduce a novel formulation to quantify the safety risk of a visual input without needing a text-only counterpart (which is hard to generate accurately).

Refusal Direction ( $d_{ref}$ ): A vector in the activation space representing the model's tendency to refuse harmful requests. It is pre-computed using a small set of harmful and harmless text prompts.
RSS Definition: Instead of measuring the shift of the current input, DTR measures the maximum shift achievable by optimizing the visual tokens ( $x_{img}$ $x_{im g}$ ) along the reversal refusal direction.
- Hypothesis: Jailbreak queries are highly "optimizable" along the refusal direction (i.e., small changes to visual tokens can flip the model from refusing to complying). Benign queries are not.
- Metric: $\Delta^*_{safe}(x) = \max_{\alpha} (f(x) - f(x^{(\alpha)})) \cdot d_{ref}$ . If the RSS is high, the input is likely a jailbreak.

The Defense Mechanism

DTR formulates an optimization problem to find a scaling vector $\alpha$ for visual tokens that minimizes the safety shift while preserving semantic meaning.

Optimization Objective:
$\alpha^* = \arg \min_{\alpha} \left( f(x^{(\alpha)}) \cdot d_{ref} + \lambda \| f(x) - f(x^{(\alpha)}) \|^2 \right)$
- Term 1: Minimizes the projection onto the refusal direction (pushing the model back toward safety).
- Term 2: A regularization term ensuring the reweighted activation remains close to the original, preserving utility for benign queries.
- $\lambda$ : A hyperparameter balancing safety and utility.

Implementation Strategies for Efficiency

To ensure DTR is practical, two strategies are employed:

Early Stopping: Optimization converges quickly (within ~4 steps) for jailbreak queries, so full convergence is not required.
Token Eviction: Visual tokens with scaling factors $\alpha_i$ below a threshold $\beta$ are completely evicted (removed) from the KV cache. This reduces computational load and removes adversarial noise.

3. Key Contributions

Novel Defense Paradigm: First work to apply KV cache optimization specifically for multimodal safety, moving away from costly fine-tuning or image-to-text conversion.
New Formulation: Introduces Reversal Safety-Relevant Shift (RSS), a method to quantify visual-induced safety risks without needing accurate image captions or reference models.
Dynamic Reweighting: A mechanism that selectively attenuates adversarial visual tokens while preserving feature tokens, offering both robustness and interpretability (via the $\alpha$ weights).
Adversarial Dilemma: DTR creates a fundamental trade-off for attackers. To bypass safety, they must increase the importance of adversarial tokens, which destroys semantic coherence. To preserve semantics, they must reduce adversarial token importance, which makes the attack detectable.

4. Experimental Results

The authors evaluated DTR across diverse VLMs (LLaVA-1.5, LLaVA-Llama2, MiniGPT-v2, InternVL, Llama-4) and benchmarks (HADES, MM-SafetyBench, JailbreakV-28K).

Attack Robustness:
- DTR significantly outperforms state-of-the-art defenses (AdaShield, JailGuard, CoCA, ShiftDC).
- On the HADES benchmark (S+T+A attack), DTR reduced the Attack Success Rate (ASR) from 56.9% (Base) to 15.9%, whereas the next best defense (AdaShield) only reduced it to 17.6%.
- It effectively handles various attack types: adversarial perturbations, Stable Diffusion embeddings, and typography.
Utility Preservation:
- Unlike baselines that degrade performance on benign tasks, DTR maintains high scores on MM-Vet and MME benchmarks.
- It preserves capabilities in OCR, math, and spatial awareness, with negligible degradation compared to the base model.
Inference Efficiency:
- DTR is lightweight. It adds minimal overhead (Average Inference Time: 4.01s vs. 3.65s for Base).
- In contrast, ShiftDC (which uses image-to-text) takes 10.66s, and AdaShield takes 5.24s.
Interpretability:
- The scaling vector $\alpha$ provides visual heatmaps showing exactly which tokens are downweighted. In jailbreak cases, adversarial noise tokens are suppressed (low $\alpha$ ), while semantic feature tokens remain high. In benign cases, weights remain uniform.
Adaptive Attacks:
- Even under strong adaptive attacks where the adversary tries to minimize the RSS, DTR remains robust, forcing the attacker into a dilemma between evading detection and maintaining attack efficacy.

5. Significance

Paradigm Shift: DTR demonstrates that safety in VLMs can be enhanced through inference-time optimization of internal representations (KV caches) rather than relying on external data curation or expensive conversion pipelines.
Practicality: By eliminating the need for image-to-text conversion and model fine-tuning, DTR offers a deployable, low-latency solution for real-world VLM applications.
Theoretical Insight: The work validates the universality of "refusal directions" across different datasets and domains, confirming that safety alignment is an intrinsic geometric property of the model that can be manipulated via token weighting.
Future Direction: It opens a new research avenue for securing foundation models by manipulating attention mechanisms and KV caches, rather than just modifying inputs or outputs.