EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Imagine you have a very smart, very talkative robot assistant. You can ask it to do almost anything: "Write a poem," "Plan a vacation," or "Fix a bug in my code." But what happens if you ask it to handle your money?

In the world of blockchain (specifically Ethereum and similar chains), a tiny mistake isn't just a typo; it's like accidentally sending your life savings to the wrong person. Once that money is gone, it's gone forever. There is no "Undo" button.

This paper introduces EVM-QuestBench, a new "driving test" for these AI assistants, specifically designed to see if they can safely handle financial transactions on the blockchain.

Here is the breakdown of how it works, using some simple analogies:

1. The Problem: The "Fake" vs. The "Real"

Previously, people tested AI code by checking if the words looked similar to the correct answer (like a teacher grading a spelling test).

The Old Way: If the AI wrote code that looked like it should swap tokens, but the math was wrong, it might still get a high grade because the words were right.
The Reality: In blockchain, if the math is wrong, the money vanishes.
The New Way (EVM-QuestBench): Instead of just reading the answer, this benchmark actually runs the code in a safe, simulated environment. It's like giving the AI a test drive in a real car (or a very realistic simulator) instead of just asking it to describe how to drive.

2. The Test Structure: Two Types of Challenges

The test is divided into two levels, like a video game with different stages:

Level 1: Atomic Tasks (The "Single Shot")
- Analogy: Asking the AI to "Send $10 to Bob."
- The Test: The AI must get the address right, the amount right, and the units right (e.g., knowing that 1 ETH = 1,000,000,000,000,000,000 "wei").
- Goal: Can it do one simple thing perfectly?
Level 2: Composite Tasks (The "Obstacle Course")
- Analogy: Asking the AI to "Swap $100 of ETH for USDT, then add that USDT to a liquidity pool, and finally stake it."
- The Test: This requires a plan. The AI can't just swap; it must first approve the spending, then swap, then add liquidity, then stake. If it forgets the first step, the whole thing fails.
- Goal: Can it plan a multi-step journey without getting lost?

3. The "Magic" of Dynamic Numbers

Most tests use fixed numbers (e.g., "Always send 5 coins"). If an AI memorized the answer "5," it would pass.

EVM-QuestBench's Trick: Every time the test runs, the numbers change randomly. One time it might be "Send 0.37 coins," the next time "Send 99.12 coins."
Why? This forces the AI to actually understand the math and the logic, rather than just memorizing a pattern. It's like a math teacher who changes the numbers on every test so students can't just memorize the answers.

4. The "Safe Sandbox"

How do they test this without losing real money?

They use a Forked Chain. Imagine a perfect, digital twin of the real blockchain. It looks and acts exactly like the real thing, but it's a simulation.
They take a "snapshot" of the state before the AI tries a task. If the AI fails or crashes, they just rewind the snapshot and try again. It's like a "Save State" in a video game.

5. The Results: Who Passed?

They tested 20 different AI models. Here is what they found:

The "Precision" Models: Some AIs were great at single tasks (Level 1) but terrible at planning long sequences (Level 2). They could send money but couldn't plan a complex investment strategy.
The "Planner" Models: Some AIs were great at the long, complex plans but made silly mistakes on simple math.
The "Code Specialists": Interestingly, some models specifically trained on coding failed the planning parts. They wrote code that looked perfect but couldn't figure out the order of operations (like trying to open a door before building the house).

The Big Takeaway

This paper isn't just about ranking AI models; it's about safety.

It shows that just because an AI can write code that looks good, doesn't mean it can handle real money. We need a new kind of test—one that actually runs the code in a realistic, changing environment—to make sure these robots won't accidentally drain our bank accounts.

In short: EVM-QuestBench is the "Driver's License Exam" for AI in the world of crypto, ensuring they can actually drive the car before letting them on the highway with your money.

Here is a detailed technical summary of the paper EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation.

1. Problem Statement

Large Language Models (LLMs) are increasingly used for code generation, including blockchain transaction scripts. However, in the Web3 domain, even minor errors (e.g., incorrect addresses, unit mismatches, or missing protocol prerequisites) can lead to irreversible financial losses.
Existing benchmarks for code generation (e.g., HumanEval, SWE-bench) suffer from two main limitations in this context:

Evaluation Method: They often rely on lexical overlap (BLEU) or static test cases that do not verify actual execution.
Domain Specificity: They fail to capture blockchain-specific constraints such as shared mutable state, transaction reverts, strict unit handling (decimals), and multi-step workflow dependencies (e.g., approval before swap).
Current blockchain benchmarks often lack a unified interface to disentangle single-transaction precision from multi-step workflow completion.

2. Methodology: EVM-QuestBench

The authors introduce EVM-QuestBench, a benchmark designed to evaluate the ability of LLMs to convert natural language instructions into executable TypeScript transaction scripts on EVM-compatible chains (specifically instantiated on BNB Smart Chain).

A. Benchmark Architecture & Composition

Task Structure: The benchmark contains 107 tasks divided into two splits:
- Atomic Tasks (62): Single on-chain actions (e.g., transfers, swaps, approvals).
- Composite Tasks (45): Multi-step workflows requiring planning, prerequisite handling, and parameter propagation (e.g., approve $\to$ swap $\to$ stake).
Dynamic Instantiation: To prevent memorization and test robustness, tasks are not static.
- Templates: Natural language instructions are drawn from a pool of LLM-generated templates.
- Parameters: Numeric values (amounts, addresses) are sampled from predefined intervals at runtime.
- Execution: The runner executes scripts on a forked EVM chain (Anvil) with snapshot isolation, ensuring every task starts from a clean, identical state.

B. Evaluation Protocol

Execution-Grounded Scoring: Instead of comparing code to a reference, the system executes the generated code.
Validators: Specialized validator classes check post-execution state constraints (e.g., balance changes, transaction receipts) against the dynamically sampled parameters.
- Atomic Scoring: Weighted sum of checks (Transaction success, address correctness, function signature, state change).
- Composite Scoring: Based on final end-state success, penalized by step-efficiency decay. If a model takes more steps ( $K_{act}$ ) than optimal ( $K_{opt}$ ), the score is scaled by $\min(1, K_{opt}/K_{act})$ .
Inference Strategy:
- Atomic: Single-shot generation.
- Composite: Multi-turn interaction with explicit planning phases (decomposing the goal into subtasks) followed by iterative execution and feedback loops.

C. Experimental Setup

Models Evaluated: 20 diverse LLMs (including GPT-5, Claude-Sonnet-4.5, Gemini, DeepSeek, Qwen, etc.).
Rounds: 5 independent evaluation rounds per model to calculate confidence intervals and ensure statistical robustness.
Environment: BSC Mainnet fork (Chain ID 56) with pre-funded accounts and deployed auxiliary contracts.

3. Key Contributions

Execution-Grounded Benchmark: The first benchmark specifically for natural-language-to-transaction-script generation that validates outcomes via actual on-chain execution rather than static code matching.
Atomic/Composite Paradigm: A novel split that isolates single-action precision from multi-step workflow completion, revealing that models often excel in one area but fail in the other.
Modular & Dynamic Design: A JSON-based task definition system with dynamic parameter sampling that minimizes development costs and prevents data leakage/memorization.
Comprehensive Evaluation: A rigorous 5-round statistical analysis of 20 models, providing confidence intervals and rank-order consistency metrics.

4. Results and Analysis

Performance Variance: There is a substantial performance gap between models. The top model (Claude-Sonnet-4.5) achieved a mean total score of 8,236/10,700, while the lowest scored near zero.
Capability Asymmetry:
- Precision vs. Workflow: A persistent asymmetry was observed. Some models (e.g., DeepSeek-V3.2, Gemini-2.5-Flash) showed high Composite scores despite weaker Atomic scores, indicating strong planning/sequencing abilities.
- Code-Specialized Failures: Several code-specialized models (e.g., Qwen3-Coder variants) achieved non-trivial Atomic scores but failed catastrophically on Composite tasks (scores near zero) due to interface/schema errors in multi-step contexts.
Step Efficiency: Top models demonstrated high step efficiency (e.g., Claude-Sonnet-4.5 at 88.0%), completing tasks with minimal unnecessary steps. Lower-tier models often exceeded optimal step counts, leading to score decay and higher failure rates.
Failure Modes: The hardest workflows involved multi-stage DeFi operations (liquidity + staking) and batch approvals, where failures stemmed from incorrect prerequisite handling, parameter inconsistency across steps, or execution robustness issues.

5. Significance

Safety & Reliability: EVM-QuestBench addresses the critical safety gap in Web3 AI. By prioritizing execution accuracy over syntactic similarity, it provides a realistic measure of an LLM's ability to handle financial transactions without causing irreversible loss.
Diagnostic Tool: The split reporting (Atomic vs. Composite) allows developers to diagnose specific model weaknesses—whether a model lacks basic parameter handling or fails at complex reasoning/planning.
Standardization: It establishes a standardized protocol for evaluating on-chain automation, which can be ported to other chains (the authors note successful adaptation to Solana).
Future Direction: The benchmark highlights the need for models that can not only write code but also understand protocol semantics, manage state dependencies, and execute robustly in dynamic, high-stakes environments.

In conclusion, EVM-QuestBench demonstrates that while current LLMs show promise in single-step blockchain tasks, significant challenges remain in multi-step workflow orchestration, necessitating further research into planning, state management, and execution safety.