EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

This paper introduces EVM-QuestBench, an execution-grounded benchmark designed to rigorously evaluate the safety and accuracy of natural-language transaction code generation on EVM-compatible chains through dynamic, fork-based testing of 107 atomic and composite tasks.

Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang, Tianyu Shi

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you have a very smart, very talkative robot assistant. You can ask it to do almost anything: "Write a poem," "Plan a vacation," or "Fix a bug in my code." But what happens if you ask it to handle your money?

In the world of blockchain (specifically Ethereum and similar chains), a tiny mistake isn't just a typo; it's like accidentally sending your life savings to the wrong person. Once that money is gone, it's gone forever. There is no "Undo" button.

This paper introduces EVM-QuestBench, a new "driving test" for these AI assistants, specifically designed to see if they can safely handle financial transactions on the blockchain.

Here is the breakdown of how it works, using some simple analogies:

1. The Problem: The "Fake" vs. The "Real"

Previously, people tested AI code by checking if the words looked similar to the correct answer (like a teacher grading a spelling test).

  • The Old Way: If the AI wrote code that looked like it should swap tokens, but the math was wrong, it might still get a high grade because the words were right.
  • The Reality: In blockchain, if the math is wrong, the money vanishes.
  • The New Way (EVM-QuestBench): Instead of just reading the answer, this benchmark actually runs the code in a safe, simulated environment. It's like giving the AI a test drive in a real car (or a very realistic simulator) instead of just asking it to describe how to drive.

2. The Test Structure: Two Types of Challenges

The test is divided into two levels, like a video game with different stages:

  • Level 1: Atomic Tasks (The "Single Shot")

    • Analogy: Asking the AI to "Send $10 to Bob."
    • The Test: The AI must get the address right, the amount right, and the units right (e.g., knowing that 1 ETH = 1,000,000,000,000,000,000 "wei").
    • Goal: Can it do one simple thing perfectly?
  • Level 2: Composite Tasks (The "Obstacle Course")

    • Analogy: Asking the AI to "Swap $100 of ETH for USDT, then add that USDT to a liquidity pool, and finally stake it."
    • The Test: This requires a plan. The AI can't just swap; it must first approve the spending, then swap, then add liquidity, then stake. If it forgets the first step, the whole thing fails.
    • Goal: Can it plan a multi-step journey without getting lost?

3. The "Magic" of Dynamic Numbers

Most tests use fixed numbers (e.g., "Always send 5 coins"). If an AI memorized the answer "5," it would pass.

  • EVM-QuestBench's Trick: Every time the test runs, the numbers change randomly. One time it might be "Send 0.37 coins," the next time "Send 99.12 coins."
  • Why? This forces the AI to actually understand the math and the logic, rather than just memorizing a pattern. It's like a math teacher who changes the numbers on every test so students can't just memorize the answers.

4. The "Safe Sandbox"

How do they test this without losing real money?

  • They use a Forked Chain. Imagine a perfect, digital twin of the real blockchain. It looks and acts exactly like the real thing, but it's a simulation.
  • They take a "snapshot" of the state before the AI tries a task. If the AI fails or crashes, they just rewind the snapshot and try again. It's like a "Save State" in a video game.

5. The Results: Who Passed?

They tested 20 different AI models. Here is what they found:

  • The "Precision" Models: Some AIs were great at single tasks (Level 1) but terrible at planning long sequences (Level 2). They could send money but couldn't plan a complex investment strategy.
  • The "Planner" Models: Some AIs were great at the long, complex plans but made silly mistakes on simple math.
  • The "Code Specialists": Interestingly, some models specifically trained on coding failed the planning parts. They wrote code that looked perfect but couldn't figure out the order of operations (like trying to open a door before building the house).

The Big Takeaway

This paper isn't just about ranking AI models; it's about safety.

It shows that just because an AI can write code that looks good, doesn't mean it can handle real money. We need a new kind of test—one that actually runs the code in a realistic, changing environment—to make sure these robots won't accidentally drain our bank accounts.

In short: EVM-QuestBench is the "Driver's License Exam" for AI in the world of crypto, ensuring they can actually drive the car before letting them on the highway with your money.