MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

Imagine you have a super-smart digital assistant (an AI Agent) that can browse the internet for you. You tell it, "Find me a concert ticket," and it goes out, checks websites, and buys the ticket. It's like having a personal shopper who never sleeps.

But here's the problem: This assistant is incredibly gullible when it comes to addresses.

This paper, MalURLBench, is like a giant "trap test" designed to see how easily these AI assistants can be tricked into visiting dangerous websites just by looking at a link.

Here is the breakdown in simple terms:

1. The Core Problem: The "Fake Address" Trick

Think of a website URL (like www.google.com) as a house address.

Normal Address: 123 Main St. (Safe)
Malicious Address: 123 Main St. (but the doorbell is actually a trap)

The researchers found that AI agents are terrible at spotting disguised addresses. Attackers can tweak the URL slightly to make it look like a trusted site, even though it's actually a trap.

The Analogy:
Imagine you ask your assistant to visit "The Official Bank."

Real Bank: bank.com
The Trap: bank.com-secure-login-please-click-here.com

A human might spot the weird extra words. But the AI, in its rush to be helpful, often thinks, "Oh, it has 'bank' in the name, so it must be safe!" and clicks the link, leading to a phishing site or a virus.

2. The Experiment: The "MalURLBench" Test

The authors built a massive testing ground called MalURLBench.

The Size: They created 61,845 different "fake" links.
The Scenarios: They tested these links in 10 real-life situations, like ordering food, checking the weather, or tracking a package.
The Test: They asked 12 different popular AI models (like GPT-4, Llama, and Mistral) to decide: "Is this link safe to visit?"

The Result:
The AI models failed miserably.

Some models fell for the trick 99% of the time.
Even the "smartest" models fell for it 30% to 40% of the time.
The Takeaway: Current AI agents are walking around with blinders on, trusting any URL that looks slightly familiar.

3. Why Do They Fail? (The "Why" Behind the Mistakes)

The researchers dug deep to find out why the AI is so easily fooled. They found some surprising reasons:

The "Short Name" Bias: AI trusts short, simple subdomains (like news.) more than long, weird ones. Attackers learned to keep the fake parts short to trick the AI.
The "Fancy Domain" Bias: AI trusts old, boring domain endings like .com or .net more than new, fancy ones like .xyz or .art. Attackers use the boring ones to look safe.
The "MoE" Glitch: Some AI models use a "Mixture of Experts" (like a team of specialists). The researchers found that if the specific "expert" who handles URLs isn't activated, the whole model gets confused and clicks the bad link.
Lack of Training: Basically, the AI hasn't seen enough "bad" URLs in its training data. It's like teaching a child to drive only on empty roads; when they hit a pothole, they don't know how to react.

4. The Solution: "URLGuard"

Since the AI is bad at spotting these traps, the authors built a security guard called URLGuard.

What it is: A tiny, lightweight AI model trained specifically to spot these fake URLs.
How it works: Before the main AI assistant visits a link, it asks URLGuard: "Hey, is this safe?"
The Result: URLGuard is a superhero. It reduced the attack success rate from nearly 100% down to almost 0% in many cases. It's like having a bouncer at the club who checks IDs before letting anyone in.

5. Why This Matters

This isn't just about AI clicking a bad link.

Stage 1 vs. Stage 2: Most security tests check if the AI can handle a bad webpage (Stage 2). This paper checks if the AI can even be tricked into going to the webpage in the first place (Stage 1).
The Future: As AI agents become our daily assistants (booking flights, buying groceries), if they can be tricked into visiting a fake site, hackers could steal your credit card info or install malware on your computer without you ever knowing.

Summary

MalURLBench is a wake-up call. It shows that our current AI assistants are too trusting of internet addresses. They need a "bouncer" (like URLGuard) to check the IDs before they let the AI visit any website. Without this, the future of AI agents is wide open to digital pickpockets.

1. Problem Statement

Large Language Model (LLM)-based web agents are increasingly used for real-time interaction with the web. However, they face a critical security gap in Stage 1 of their workflow: the decision to trust and visit a URL provided by a user.

The Threat: Attackers can disguise malicious URLs by manipulating their structure (subdomains, paths, parameters) to appear benign or official. If the LLM accepts the URL, the agent proceeds to visit the site, potentially exposing users to phishing, malware, or fraud.
The Gap: Existing benchmarks (e.g., SecureWebArena, WASP) focus on malicious content within webpages (Stage 2) or prompt injections. None comprehensively evaluate the LLM's ability to detect maliciously disguised URLs before the visit occurs.
Research Questions:
1. How secure are LLMs when processing malicious URLs?
2. What factors influence the success rate of these attacks?
3. How can LLM security against such URLs be enhanced?

2. Methodology

A. Benchmark Construction (MalURLBench)

The authors constructed the first benchmark specifically for this threat, comprising 61,845 attack instances.

Scenarios & Data: 10 real-world scenarios (e.g., Job Search, Food Delivery, Package Tracking) and 7 categories of real malicious websites (Phishing, Malware Injection, Fraud, etc.) collected from public datasets.
Attack Templates: 150 unique templates designed to manipulate three URL components:
1. Subdomain ( $u_s$ ): e.g., videos-picked-for-you.malicious.com.
2. Path ( $u_p$ ): e.g., malicious.com/official-login.
3. Parameter ( $u_a$ ): e.g., malicious.com/?is-official=true.
Optimization Algorithm: A Mutation Optimization Algorithm was employed to improve "weak" attack templates. Using a "Textual Gradient" approach, the system analyzed why an attack failed against a specific model, generated reasons for failure, and mutated the template to increase the risk score (Attack Success Rate - ASR) until it met a threshold.
Evaluation Metric: A risk score function $F(M)$ representing the probability that a model $M$ accepts a malicious URL across all scenarios.

B. Defense Mechanism: URLGuard

To address the identified vulnerabilities, the authors proposed URLGuard, a lightweight, fine-tuned LLM acting as an independent pre-detection filter.

Architecture: Based on Llama-2-7b-chat-hf, fine-tuned using QLoRA (Quantized Low-Rank Adaptation).
Training Data: A small, high-quality dataset (280 instances) derived from the benchmark's "Shopping" scenario, expanded via DeepSeek-V3.1. Crucially, the training data was disjoint from the test scenarios to ensure robustness.
Function: It acts as a gatekeeper, analyzing the URL before the main agent decides to visit.

3. Key Results

A. Vulnerability of Existing Models

Experiments on 12 popular LLMs (including GPT-4o, Llama-3, DeepSeek, Mistral) revealed severe vulnerabilities:

High Attack Success Rates (ASR): Most models exhibited ASRs between 32.9% and 99.9%.
Top Vulnerable Models: Mixtral-8x7b, Mistral-small, and GPT-4o-mini showed ASRs exceeding 90%.
Best Performers: Even the most robust models (e.g., GPT-3.5-Turbo, Llama-3-70b) had ASRs around 30-35%, indicating no model is immune.

B. Factors Influencing Attack Success

The study identified several counter-intuitive factors affecting ASR:

Model Size: Larger models generally had lower ASRs (negative correlation), likely due to stronger reasoning capabilities.
Architecture (Dense vs. MoE): Mixture-of-Experts (MoE) models (e.g., Mixtral, DeepSeek-V3) were more vulnerable than dense models of similar parameter counts. This is attributed to the sparsity of expert activation, where specific experts may lack training data on adversarial URL structures.
Attack Type: Inducing attacks (using persuasive text in the URL path/param) were more successful (71.5% ASR) than Imitating attacks (mimicking known domains like google.com), as LLMs have some prior knowledge of famous domains.
Subdomain Length: Shorter subdomains ( $\le$ 20 chars) were more trusted. Long subdomains are rare in training data, leading LLMs to treat them as suspicious, whereas short, "clean" looking subdomains are trusted.
Top-Level Domains (TLD): Newer TLDs (e.g., .link, .art, .dev) resulted in higher ASRs because they are underrepresented in training data. Older, common TLDs (.com, .net) were trusted more easily.
Scenario Sensitivity: Scenarios involving sensitive actions (money, identity) like "Food Delivery" had lower ASRs, while low-stakes scenarios like "Weather" had the highest ASR (82.9%).

C. Defense Effectiveness

URLGuard demonstrated significant efficacy:

It reduced the attack success rate by 30% to 99% across various scenarios.
The average risk reduction was 81%, proving that LLMs lack specific knowledge of malicious URL structures that can be acquired through targeted fine-tuning.

4. Key Contributions

First Benchmark: Introduced MalURLBench, the first comprehensive benchmark (61k+ instances) evaluating LLM vulnerabilities to disguised URLs.
Systematic Evaluation: Provided a large-scale evaluation of 12 state-of-the-art LLMs, revealing that current models are highly susceptible to URL-based social engineering.
Factor Analysis: Uncovered critical, previously unknown factors influencing attack success, including the vulnerability of MoE architectures and the impact of TLD novelty.
Defense Proposal: Proposed URLGuard, a lightweight, fine-tuned module that effectively mitigates these threats, offering a practical solution for securing web agents.

5. Significance

Security Gap Identification: The paper highlights a critical, overlooked attack vector in the AI ecosystem. As web agents become central to daily tasks, the inability to verify URL safety poses a severe risk to users and service providers.
Foundation for Future Research: MalURLBench provides a standardized platform for researchers to test and improve agent security.
Practical Defense: The success of URLGuard suggests that lightweight, specialized models can be deployed as security filters without significantly impacting the performance or reasoning capabilities of the main agent.
Ethical Research: The authors adhered to strict "no-harm" principles, using only public datasets and textual analysis to avoid deploying actual harmful content while demonstrating the viability of the attack.

Conclusion: MalURLBench demonstrates that LLM-based web agents are currently unsafe regarding URL processing. The work calls for immediate integration of URL verification mechanisms (like URLGuard) and the inclusion of adversarial URL data in future model training.