Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

This paper challenges the optimistic narrative of fully automated AI auditing by demonstrating that EVMbench's results are compromised by data contamination and narrow evaluation, revealing that current AI agents lack the stability and end-to-end exploitation capabilities to replace human auditors in smart contract security.

Chaoyuan Peng, Lei Wu, Yajin Zhou

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine the world of cryptocurrency smart contracts as a massive, high-stakes bank vault. Before anyone can put their money inside, they need to check the locks, the alarms, and the steel doors for any weaknesses. This is called "auditing."

Recently, a group of tech giants released a report (called EVMbench) claiming that AI agents (super-smart computer programs) were ready to take over this job entirely. They said, "AI can find 45% of the bugs and actually break into 72% of the test vaults! We don't need humans anymore!"

This paper, written by researchers from Zhejiang University and BlockSec, is like a skeptical detective saying, "Hold on a minute. Let's check the math."

Here is the breakdown of what they found, using simple analogies:

1. The "Practice Exam" Problem (Data Contamination)

The Old Report: The original AI test used questions from a "practice exam" (Code4rena contests) that the AI models might have already seen in their training data.
The New Reality: It's like giving a student a math test where the answers were printed in the back of their textbook. Of course, they got a high score!
The Fix: The researchers created a brand new test using 22 real-world bank robberies that happened after the AI models were built. The AI had never seen these crimes before.
The Result: When the AI faced these "fresh" crimes, its performance dropped significantly. It could spot the idea of a crime sometimes, but it couldn't actually pull off the heist.

2. The "Toolbox" Matters More Than the "Brain"

The Old Report: The original test mostly paired each AI with its own brand's tools (e.g., Google's AI with Google's tools, Anthropic's AI with Anthropic's tools).
The New Reality: The researchers found that the tools (the "scaffold") matter just as much as the brain (the AI model).
The Analogy: Imagine a master chef (the AI). If you give them a rusty, broken knife (a bad tool), they can't cook a great meal. But if you give them a sharp, professional knife (a good open-source tool), they suddenly become a genius.
The Result: An older, open-source toolset actually helped the AI perform better than the fancy, brand-new tools the original report used. The original report blamed the "brain" for low scores, but it was actually the "knife" that was dull.

3. Finding the Bug vs. Stealing the Money

The Old Report: The original conclusion was: "Finding the bug is the hard part. Once found, stealing the money is easy."
The New Reality: The researchers found that finding the bug and stealing the money are two completely different skills.
The Analogy: Imagine a detective who can look at a house and say, "Hey, the back door is unlocked!" (Finding the bug). That's great. But can that same detective then sneak in, avoid the laser grid, pick the safe, and walk out with the diamonds without getting caught? (Exploiting the bug).
The Result: The AI was okay at pointing out the unlocked door (finding bugs), but it completely failed at actually stealing the diamonds in real-world scenarios. In the new test, the AI tried to steal from 110 real-world scenarios and succeeded 0 times.

4. The "Rankings" Are Unstable

The Old Report: They said, "Model A is the best, Model B is second."
The New Reality: The rankings changed depending on which tools you used or which specific test you gave them.
The Analogy: It's like saying "Usain Bolt is the fastest runner in the world." But then you test him on a track made of mud, and suddenly a marathon runner wins. If you test him on a treadmill, a different runner wins. You can't declare a single "champion" when the rules and the track keep changing.

The Bottom Line: What Should We Do?

The paper concludes that AI is not ready to replace human auditors entirely. It's not a "magic wand" that solves everything.

  • For Developers: Think of AI as a spell-checker. It's great at catching typos (common, well-known bugs like missing passwords). But it can't write the whole story for you, and it might miss the deep, complex plot holes.
  • For Security Firms: The best approach is a "Human-in-the-Loop" workflow.
    • The AI acts as a tireless intern who scans thousands of pages of code quickly to find the obvious mistakes.
    • The Human acts as the senior detective who takes the AI's notes, adds context, understands the complex logic, and makes the final judgment.

In short: AI is a powerful assistant, but it's not the boss yet. If you let it work alone, you might think your vault is secure when it's actually wide open. The future isn't "AI vs. Humans"; it's "AI + Humans" working together.