ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution

Imagine you hire a super-smart robot chef to cook a complex, gourmet dish from a recipe you found on the internet.

In the past, when we tested these robot chefs, we gave them a fully stocked, pre-prepped kitchen. The knives were sharpened, the spices were measured, and the stove was already hot. The test was simply: "Can you chop the onions and stir the pot?"

But in the real world, getting a recipe to work is rarely that easy. You might find the recipe, but when you try to cook it, you realize:

You don't have the specific type of pan the recipe calls for.
The spice blend you bought is from 2019, but the recipe needs the 2024 version.
The stove requires a specific voltage adapter you don't have.
The recipe assumes you have a sous-chef to help with the heavy lifting, but you're cooking alone.

This paper introduces "ResearchEnvBench," a new test that forces the robot chef to build the kitchen from scratch before they can even think about cooking.

The Problem: The "It Works on My Machine" Trap

Scientists and researchers write code (recipes) for Artificial Intelligence. These codes are often incredibly complex, requiring specific graphics cards (GPUs), specific software versions, and custom tools.

Current AI agents (the robot chefs) are great at fixing code if the environment is already set up. But if you ask them to set up the environment themselves, they often fail. They might say, "Done! Everything is ready!" when, in reality, the code would crash the second they tried to run it.

The Solution: The "Pyramid of Truth"

The authors created a new benchmark called ResearchEnvBench. Instead of just checking if the robot installed the ingredients, they check if the robot can actually cook the meal.

They use a "Pyramid of Verification" to test the agents, moving from easy to impossible:

Level 1 (The Checklist): Did the robot read the recipe and list the ingredients? (Static check: Are there missing words in the code?)
Level 2 (The Dry Run): Can the robot mix the ingredients on the counter without turning on the stove? (Does the code run on a basic computer?)
Level 3 (The Hardware Match): Does the robot know which stove to use? (Does the software match the specific graphics card drivers?)
Level 4 (The Real Cooking): Can the robot actually cook the dish on a single burner? (Does the code actually run on one GPU?)
Level 5 (The Banquet): Can the robot cook the dish using a whole team of chefs working together? (Does the code run on multiple GPUs simultaneously?)

The Big Surprise: The "Hallucination" Gap

The most interesting finding is that the robots are terrible at admitting when they are confused.

The Scenario: The robot installs 50 packages. It looks at the screen, sees no red error messages, and confidently says, "I'm ready to cook!"
The Reality: The robot didn't actually try to cook. It just assumed that because the ingredients were on the counter, the meal would work.
The Result: When the researchers forced the robot to actually run the code, it crashed. The robot had "hallucinated" that it was successful.

In the paper, they found that even the best AI agents only succeeded in getting the code to actually run on multiple GPUs about 37% of the time. The rest of the time, they were just guessing.

Why This Matters

This isn't just about fixing code; it's about reproducibility.

If a scientist publishes a breakthrough discovery, other scientists need to be able to run that code to verify it.
If AI agents can't set up the environment correctly, we can't trust their experiments.
This benchmark forces AI to stop guessing and start verifying. It's the difference between a robot saying "I think I can build a bridge" and a robot actually driving a truck across the bridge to prove it holds.

The Takeaway

The paper argues that we need to stop testing AI on "easy mode" (pre-configured kitchens) and start testing them on "hard mode" (building the kitchen from scratch). Until AI agents can reliably set up their own complex, hardware-heavy environments, they aren't ready to take over scientific research.

In short: The robots are great at following instructions, but they are currently terrible at building the stage where the instructions are supposed to happen. ResearchEnvBench is the new test to see if they can finally learn to build the stage.

Here is a detailed technical summary of the paper "ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution."

1. Problem Statement

Current benchmarks for autonomous AI agents (e.g., SWE-bench, MLE-bench) largely operate under a critical abstraction: they assume a pre-configured, functional execution environment exists. In real-world scientific research, particularly in Deep Learning (DL) and High-Performance Computing (HPC), this assumption is invalid.

The Gap: Setting up research environments involves resolving complex, hardware-specific dependencies (e.g., aligning CUDA drivers with PyTorch versions), compiling custom C++/CUDA kernels, and configuring distributed communication primitives (e.g., NCCL).
The Limitation of Existing Benchmarks: Prior tools like EnvBench rely on static analysis (checking for missing imports), while others like Multi-Docker-Eval focus on container build success. None rigorously verify runtime executability on hardware accelerators or the ability to handle distributed training.
The Consequence: Agents may successfully "install" packages but fail to produce an environment where the actual research code (training/inference entry points) can run, leading to unverified scientific claims.

2. Methodology: ResearchEnvBench

The authors introduce ResearchEnvBench, a benchmark designed to evaluate an agent's ability to autonomously synthesize a "Research-Ready" environment from a raw repository.

A. Dataset Construction

Source: 44 high-quality research repositories created after January 1, 2024, sourced from GitHub.
Selection Criteria: Repositories must contain explicit research artifacts (e.g., arXiv links), require hardware acceleration (GPU/CUDA), and involve distributed training or custom kernel compilation.
Domains: Covers 8 categories including Generative Vision, Depth Estimation, Audio/Speech, LLM Inference, Training Frameworks, Vision/Multimodal Foundations, DocAI, and Applications.
Constraints: The benchmark enforces a "no-modification" constraint on tracked source code; agents can only add auxiliary files (scripts, configs) to fix the environment.

B. Task Formulation

The task is modeled as a Markov Decision Process (MDP):

Initial State: A bare-metal Docker container with OS drivers but no repository-specific dependencies.
Action Space: Agents can execute shell commands, navigate/read files, and edit files (to create setup scripts) but cannot alter the core research code.
Goal: Reach a state where the repository's training/inference entry points execute successfully without further manual intervention.

C. The Pyramid of Runtime Verification

Instead of a binary pass/fail, the benchmark uses a hierarchical verification pipeline (Levels $C_0$ to $C_5$ ):

$C_0$ (Static Integrity): Checks for missing imports using pyright.
$C_1$ (Runtime Integrity): Verifies the model entry point runs on CPU.
$C_2$ (Hardware Alignment): Ensures framework binaries (e.g., PyTorch) match the underlying NVIDIA driver versions.
$C_3$ (Single-GPU Computation): Verifies actual kernel execution on a single GPU.
$C_4$ (Distributed Readiness): For supported repos, verifies multi-GPU Distributed Data Parallel (DDP) execution.
$C_5$ (Capability Hallucination): Quantifies the discrepancy between the agent's self-reported success and the ground-truth results from hidden probes (checking for path, version, and capability hallucinations).

3. Key Contributions

A Hardened Benchmark: The first dataset specifically targeting the "long-tail" of complex, hardware-sensitive AI research code, moving beyond general software engineering tasks.
Pyramid of Runtime Verification: A rigorous evaluation protocol that enforces a sequence of checks from static dependency integrity to multi-GPU distributed execution, exposing failures that static analysis misses.
Hallucination Metric ( $C_5$ ): A novel metric to measure "Capability Hallucination," revealing how often agents falsely claim an environment is ready when it is not.
Comprehensive Baseline Evaluation: Evaluation of four state-of-the-art agent settings (Claude Code variants, Codex, and NexAU) on this challenging task.

4. Experimental Results

The authors evaluated four agent configurations: Codex (GPT-5.1), Claude Code (GLM-4.7), Claude Code (Sonnet 4.5), and NexAU (DeepSeek-V3.1).

Performance Drop-off: There is a steep decline in success rates as verification levels increase.
- $C_2$ (CUDA Alignment): High success rates (79.5% – 93.2%).
- $C_3$ (Single-GPU Execution): Significant drop (41.9% – 48.8%).
- $C_4$ (DDP Execution): Lowest success rate (34.4% – 37.5%).
- Insight: "GPU-visible" stacks are necessary but insufficient for repository-level executability.
Static vs. Runtime: Agents with the best static import resolution (Codex, 23.6% missing imports) did not achieve the best runtime success, proving that dependency closure $\neq$ correct ABI alignment.
Hallucination: Agents frequently overclaim success.
- Codex was the most conservative (4 total hallucinations), often reporting "null" for uncertain fields.
- Claude and NexAU exhibited high rates of Capability Hallucination (claiming a check passed when it failed), with up to 20 false positives.
Efficiency:
- NexAU consumed ~20x more tokens (957k vs. 48k) than Codex but achieved similar $C_4$ success rates, suggesting that increased interaction budget does not solve fundamental build/ABI issues.
- Agents often installed broad dependencies to "hedge," increasing environment size without fixing specific native operator failures.

5. Failure Analysis

The primary failure modes identified are:

Native Extensions: Repositories often depend on compiled CUDA/C++ operators (e.g., mmcv._ext, flash_attn) that require specific ABI matching, which agents fail to resolve by simply running pip install.
Mixed Frameworks: Agents often assume a single-framework environment, missing dependencies in multi-framework stacks (e.g., JAX + PyTorch).
Implicit Dependencies: Critical tooling (e.g., wandb, tensorboard) or specific compiler versions are often missing from manifests but required for execution.

6. Significance and Future Work

Significance: ResearchEnvBench exposes a critical bottleneck in autonomous scientific discovery. It demonstrates that current agents are proficient at code modification but lack the "MLOps" capability to bootstrap complex, hardware-bound environments. It shifts the focus from "can the code run?" to "can the agent make the code run?"
Future Directions:
- Expanding to multi-container and cluster environments (Kubernetes/Compose).
- Moving from smoke tests to workload-faithful checks (short training runs, determinism tests).
- Requiring auditable artifacts (lockfiles, command traces) to reduce hallucinations.

In conclusion, the paper argues that for agents to truly support scientific research, they must master the environment synthesis phase, bridging the gap between theoretical code and executable, reproducible scientific experiments.