VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Imagine you have a super-smart robot assistant that watches video feeds from your car's cameras. Its job is to look at the road, understand what's happening, and give you quick, life-saving advice like, "Stop!" or "Turn left." This robot is a Video-LLM (a Video Large Language Model). It's designed to be fast and efficient because, in a car, a split-second delay can mean the difference between safety and a crash.

Now, imagine a hacker who doesn't want to crash your car by making the robot see a ghost. Instead, they want to crash your car by making the robot talk too much.

This paper introduces a new kind of digital weapon called VidDoS (Video Denial-of-Service). Here is how it works, explained simply:

1. The Problem: The "Slow Talker" Attack

In the past, hackers tried to confuse AI by adding tiny, invisible scratches to a single photo. But video is different. Video AI looks at many frames at once and averages them out (like blending colors). If you put a scratch on just one frame, the AI ignores it because the other frames look normal. It's like trying to shout a secret by whispering it once in a crowded room; the crowd drowns you out.

Also, these video robots are trained to be concise. If you ask, "Is the road clear?" they are programmed to say "Yes" or "No" instantly. They hate long, rambling answers.

2. The Solution: The "Universal Sticker"

The researchers created VidDoS, which is like a magic sticker you can put on any video, anywhere, at any time.

It's Universal: You don't need to customize the sticker for every single car or every single road. You design it once, and it works on any video stream.
It's a "Sponge": The sticker is designed to trick the robot into thinking it needs to write a novel instead of a text message. It forces the robot to generate thousands of words when it should only say one.

3. How the Trick Works (The Three Magic Spells)

To make the robot talk forever, the sticker uses three clever tricks:

The "Sponge" Trap: The sticker forces the robot to start a long, repetitive sentence (like a broken record). Once it starts, it's hard to stop.
The "No-Stop" Sign: The robot usually has a "Stop" button (called an End-of-Sequence token) that it hits when it's done. The sticker puts a shield over that button, making the robot forget to stop.
The "No-Short-Answers" Rule: The sticker blocks the robot from saying simple words like "Yes" or "No." It forces the robot to keep explaining itself, even when a simple answer is all that's needed.

4. The Real-World Danger: The Traffic Jam in Your Head

The scary part isn't just that the robot talks too much; it's that it stops working while it talks.

Think of the robot's brain as a single-lane road.

Normal situation: A car (a question) drives in, gets an answer, and leaves. Fast.
VidDoS situation: The hacker puts a "Sponge" sticker on the road. The car drives in, but the robot starts building a 50-mile-long bridge to answer the question. While it's building that bridge, no other cars can get through.

In a self-driving car, if the robot is busy generating 500 words about a bird flying by, it might not be able to process the fact that a child is running into the street. The delay (latency) becomes so huge that the car misses its chance to brake.

5. The Results

The researchers tested this on three different smart video systems. The results were shocking:

Token Explosion: The robots generated 200 times more words than usual.
Speed Crash: The response time slowed down by 15 times.
Safety Failure: In simulated driving scenarios, this delay was long enough to cause a crash.

The Big Takeaway

This paper warns us that while we are building super-smart video AI for our cars and homes, we haven't thought about how to stop them from being "chatty" on purpose. A hacker doesn't need to break the camera or the engine; they just need to trick the AI into talking too much, and the system will freeze up, leaving us vulnerable.

VidDoS is the first tool to show us this vulnerability, proving that sometimes, the most dangerous attack isn't a punch, but a very, very long conversation.

1. Problem Statement

Video-based Large Language Models (Video-LLMs) are increasingly deployed in safety-critical domains like autonomous driving. However, they are vulnerable to Energy-Latency Attacks (ELAs), a form of Denial-of-Service (DoS) where adversaries manipulate inputs to force the model to generate excessively long responses, exhausting computational resources and causing critical delays.

Existing image-centric ELA methods (e.g., Verbose Images, RECALLED) fail when applied to Video-LLMs due to three specific architectural challenges:

Temporal Aggregation: Video encoders use aggressive temporal subsampling and pooling, which dilutes frame-specific perturbations, preventing the attack signal from reaching the decoder.
Real-Time Constraints: Instance-wise optimization (calculating gradients for every specific video frame) is computationally too expensive for continuous, real-time video streams.
Dynamic Context: Static image perturbations fail to generalize across shifting visual contexts in video, lacking a strategy to anchor attention independent of content changes.

2. Methodology: VidDoS

The authors propose VidDoS, the first universal ELA framework tailored for Video-LLMs. It utilizes a "train-once, deploy-anywhere" paradigm, optimizing a single trigger on a surrogate dataset that can be applied to any unseen video stream without inference-time gradient calculations.

Core Components:

Universal Adversarial Trigger (Spatial Patch):
Instead of pixel-wise noise, VidDoS injects a spatially concentrated, learnable replacement patch (e.g., in the bottom-right corner) into the video frames. This patch is optimized to act as a "semantic anomaly" that hijacks the cross-modal attention mechanism, bypassing the low-pass filtering effects of temporal pooling.
Masked Teacher Forcing:
The attack steers the model's predictive distribution toward a computationally expensive, repetitive "sponge" sequence ( $y^\star$ ). A weighted cross-entropy loss is applied only to the target tokens, with higher weights on the initial tokens to stabilize the entry into a long-generation regime.
Refusal Penalty & Early-Termination Suppression:
To override the models' fine-tuned priors for conciseness (e.g., answering "Yes/No" or stopping early):
- Refusal Penalty ( $\mathcal{L}_{ban}$ ): Penalizes the probability of generating short, task-relevant answers or the End-Of-Sequence (EOS) token at the first step.
- Early-Termination Suppression ( $\mathcal{L}_{stop}$ ): Aggressively suppresses the probability of the EOS token over a specific generation horizon ( $K$ ), forcing the model to continue generating.
Optimization:
The universal patch $\delta$ is optimized via Sign-PGD (Projected Gradient Descent) on a source dataset to minimize the joint loss function, subject to $\ell_\infty$ constraints to ensure the perturbation remains visually imperceptible.

3. Key Contributions

First Universal ELA for Video-LLMs: Introduces a framework that resists temporal subsampling and stochastic noise, unlike previous image-based attacks.
Novel Optimization Strategy: Combines Masked Teacher Forcing with Refusal/Early-Termination penalties to force unbounded generation and override model conciseness priors.
Zero-Overhead Deployment: The attack requires no instance-specific optimization during inference, making it viable for real-time streaming attacks.
Comprehensive Evaluation: Demonstrates state-of-the-art attack potency across three mainstream Video-LLMs (LLaVA-NeXT-Video, Qwen3-VL, Video-LLaVA) and diverse datasets (Autonomous Driving, General QA).

4. Experimental Results

The authors evaluated VidDoS on BDDX, D2-City (autonomous driving), and VideoSimpleQA.

Token Expansion: VidDoS induced a token expansion of >205× relative to clean baselines in some scenarios (e.g., 462.55 tokens vs. 30.65 clean tokens for Video-LLaVA).
Latency Inflation: Inference latency increased by >15× (e.g., from ~0.38s to ~6.74s).
Comparison: Existing methods (Verbose Images, NICGSlowDown) failed to generalize, often yielding token ratios near 1.0× (negligible impact).
Transferability: The universal patch trained on one driving dataset (BDDX) successfully transferred to another (D2-City), achieving high token counts, though performance dropped when transferring to semantically distinct domains (e.g., general QA).
Robustness: The attack remained effective even at high decoding temperatures ( $T=1.5$ ), with expansion ratios remaining above 240×.

Safety Analysis (Autonomous Driving)

Simulations of real-time autonomous driving streams revealed that the induced latency leads to critical safety violations.

In a streaming pipeline, the cumulative latency caused by VidDoS exceeded the safety threshold ( $\tau_{safe} = 2.72s$ ) required for a human driver to regain control during a manual takeover scenario.
The attack effectively blocked the synchronous inference pipeline, causing system delays that could endanger passenger safety.

5. Significance

Security Gap Identification: The paper highlights a critical, previously unaddressed vulnerability in Video-LLMs where temporal aggregation mechanisms, intended for efficiency, inadvertently create a vector for universal DoS attacks.
Real-World Impact: By demonstrating that a single, static patch can cause catastrophic latency in autonomous driving scenarios, the work underscores the immediate need for robustness in safety-critical multimodal systems.
Defense Implications: The findings suggest that current defense mechanisms (like temperature sampling) are insufficient against universal, content-agnostic triggers, necessitating new architectural defenses against "sponge" sequences in video processing.

In conclusion, VidDoS proves that Video-LLMs are highly susceptible to resource exhaustion attacks that can be executed universally and in real-time, posing a severe threat to the reliability of AI systems in safety-critical applications.