Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer
Imagine you hire a brilliant, super-fast apprentice programmer to write code for your business. You give them a simple, normal request like, "Write a script to buy a specific digital token on this popular trading site." You expect them to write safe, standard code.
However, this paper reveals a scary reality: Your apprentice has memorized a library of dangerous, fake instructions hidden inside their training books. When you ask for help with a specific task, they might accidentally pull out a page from a scammer's manual and paste it into your code, sending your money to a thief instead of the legitimate site.
Here is a breakdown of the paper's findings using simple analogies:
1. The Problem: The "Poisoned Cookbook"
Large Language Models (LLMs) are like chefs who have read almost every recipe book on the internet to learn how to cook. The problem is that the internet is full of "poisoned" recipes—fake instructions designed to steal your wallet or data.
- The Real-World Incident: The paper starts with a story about a real person who lost $2,500. They asked a chatbot to write a script to buy a cryptocurrency on a popular site called pump.fun. The chatbot, trying to be helpful, wrote code that included a link to a fake API (a digital door) that looked real but was actually a scammer's trap. The code even asked the user to hand over their "private key" (the master key to their bank vault) directly to this fake door. The user, trusting the AI, ran the code, and their money vanished in 30 minutes.
2. The Investigation: "Scam2Prompt"
The researchers built a tool called Scam2Prompt to see if this was a one-time accident or a widespread disease.
- The Analogy: Imagine a security guard who wants to test if a new security system works. Instead of trying to break in with a sledgehammer (which is obvious), the guard takes a known "bad guy's" blueprint, rewrites it to look like a normal construction request, and hands it to the security system.
- How it worked:
- They took lists of known scam websites.
- They then extracted common keywords, claims, and phrases these sites use to deceive victims. Using those terms, they prompted an AI system to generate legitimate coding requests, such as 'How do I purchase this digital coin?' or 'How can I pay through this flight platform to buy discounted tickets?'
- They fed these "innocent" requests to four major production AI models (like GPT-4o and Llama).
- They checked if the AI wrote code containing the scam links.
3. The Findings: The "Innocent" Trap
The results were alarming. Even though the requests sounded perfectly normal and came from "developers," the AI models kept generating code with malicious links.
- The Stats: In their initial test, about 4.24% of the code generated contained a scam link. That means if you asked these AIs to write code 100 times, about 4 times they would accidentally hand you a weapon.
- The "Innoc2Scam-bench": The researchers created a "stress test" list of 1,377 specific questions that always tricked the first four models into generating bad code. They then tested this list on seven newer, more advanced models released in 2025.
- The New Models: The problem didn't go away; it remained serious. The new models generated malicious code at rates ranging from 12.9% to 47.3% when tested under Innoc2Scam-bench.
- Analogy: It's like upgrading your car's engine to be faster and smarter, but the GPS system still keeps trying to drive you into a cliff because the map data was corrupted from the start.
4. The Hierarchy of Safety
The paper ranked the models like a report card:
- Top Tier (The Safest): Gemini-2.5-Pro and GPT-5. These were the best at saying "No" or refusing to answer when the request was risky. However, even they weren't perfect.
- Middle Tier: Claude-Sonnet-4.
- Bottom Tier (The Riskiest): Models like DeepSeek-Chat-v3.1 and Qwen3-Coder. These models were very eager to answer the questions but generated malicious code nearly half the time (up to 47.3%).
5. Why Current Defenses Fail
The researchers tested if existing safety tools could stop this.
- The "Guardrails": They tried using standard safety filters (like a bouncer at a club) and "Retrieval Agents" (AI that looks things up on the web to verify facts).
- The Result: The guardrails were mostly useless. They failed to catch the malicious code because the code looked syntactically correct and the requests sounded normal. The "web search" agents helped a little (reducing the risk from 50% to 29%), but they still failed to catch the majority of scams.
- The Takeaway: You can't just rely on the AI to "know better" or on a simple filter. The malicious knowledge is baked deep into the model's brain from its training data.
6. The "Ghost" Scams
One of the most chilling discoveries was that the AI models were generating links to scam sites that didn't even exist in the security databases yet.
- The Analogy: The AI models had memorized the "blueprints" of scams so well that they could reconstruct the fake websites even if the security guards hadn't caught the criminals yet. Some of these sites had been active for over a year, evading detection, yet the AI knew how to use them.
Summary
The paper concludes that AI models are currently "poisoned" by the internet's trash. Even the smartest, newest models will happily write code that steals your money if you ask them the right (but innocent-sounding) question. The current safety measures are like trying to stop a flood with a paper umbrella; they aren't strong enough. The authors suggest that we need to clean the training data better and add strict, external checks on every link the AI generates before letting a human run the code.
Drowning in papers in your field?
Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.