ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution

The paper introduces ResearchEnvBench, a new benchmark designed to evaluate autonomous agents' ability to synthesize complex execution environments for research code, revealing significant current limitations in dependency resolution and version management.

Yubang Wang, Chenxi Zhang, Bowen Chen, Zezheng Huai, Zihao Dai, Xinchi Chen, Yuxin Wang, Yining Zheng, Jingjing Gong, Xipeng Qiu

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you hire a super-smart robot chef to cook a complex, gourmet dish from a recipe you found on the internet.

In the past, when we tested these robot chefs, we gave them a fully stocked, pre-prepped kitchen. The knives were sharpened, the spices were measured, and the stove was already hot. The test was simply: "Can you chop the onions and stir the pot?"

But in the real world, getting a recipe to work is rarely that easy. You might find the recipe, but when you try to cook it, you realize:

  • You don't have the specific type of pan the recipe calls for.
  • The spice blend you bought is from 2019, but the recipe needs the 2024 version.
  • The stove requires a specific voltage adapter you don't have.
  • The recipe assumes you have a sous-chef to help with the heavy lifting, but you're cooking alone.

This paper introduces "ResearchEnvBench," a new test that forces the robot chef to build the kitchen from scratch before they can even think about cooking.

The Problem: The "It Works on My Machine" Trap

Scientists and researchers write code (recipes) for Artificial Intelligence. These codes are often incredibly complex, requiring specific graphics cards (GPUs), specific software versions, and custom tools.

Current AI agents (the robot chefs) are great at fixing code if the environment is already set up. But if you ask them to set up the environment themselves, they often fail. They might say, "Done! Everything is ready!" when, in reality, the code would crash the second they tried to run it.

The Solution: The "Pyramid of Truth"

The authors created a new benchmark called ResearchEnvBench. Instead of just checking if the robot installed the ingredients, they check if the robot can actually cook the meal.

They use a "Pyramid of Verification" to test the agents, moving from easy to impossible:

  1. Level 1 (The Checklist): Did the robot read the recipe and list the ingredients? (Static check: Are there missing words in the code?)
  2. Level 2 (The Dry Run): Can the robot mix the ingredients on the counter without turning on the stove? (Does the code run on a basic computer?)
  3. Level 3 (The Hardware Match): Does the robot know which stove to use? (Does the software match the specific graphics card drivers?)
  4. Level 4 (The Real Cooking): Can the robot actually cook the dish on a single burner? (Does the code actually run on one GPU?)
  5. Level 5 (The Banquet): Can the robot cook the dish using a whole team of chefs working together? (Does the code run on multiple GPUs simultaneously?)

The Big Surprise: The "Hallucination" Gap

The most interesting finding is that the robots are terrible at admitting when they are confused.

  • The Scenario: The robot installs 50 packages. It looks at the screen, sees no red error messages, and confidently says, "I'm ready to cook!"
  • The Reality: The robot didn't actually try to cook. It just assumed that because the ingredients were on the counter, the meal would work.
  • The Result: When the researchers forced the robot to actually run the code, it crashed. The robot had "hallucinated" that it was successful.

In the paper, they found that even the best AI agents only succeeded in getting the code to actually run on multiple GPUs about 37% of the time. The rest of the time, they were just guessing.

Why This Matters

This isn't just about fixing code; it's about reproducibility.

  • If a scientist publishes a breakthrough discovery, other scientists need to be able to run that code to verify it.
  • If AI agents can't set up the environment correctly, we can't trust their experiments.
  • This benchmark forces AI to stop guessing and start verifying. It's the difference between a robot saying "I think I can build a bridge" and a robot actually driving a truck across the bridge to prove it holds.

The Takeaway

The paper argues that we need to stop testing AI on "easy mode" (pre-configured kitchens) and start testing them on "hard mode" (building the kitchen from scratch). Until AI agents can reliably set up their own complex, hardware-heavy environments, they aren't ready to take over scientific research.

In short: The robots are great at following instructions, but they are currently terrible at building the stage where the instructions are supposed to happen. ResearchEnvBench is the new test to see if they can finally learn to build the stage.