daVinci-Env: Open SWE Environment Synthesis at Scale

The paper introduces OpenSWE (daVinci-Env), a large-scale, fully open-source framework that synthesizes over 45,000 executable Docker environments and 13,000 curated training trajectories via a multi-agent pipeline to enable state-of-the-art software engineering agents that achieve record performance on SWE-bench while also improving general reasoning capabilities.

Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Liu

Published 2026-03-16
📖 4 min read☕ Coffee break read

Imagine you want to teach a robot how to be a master software engineer. You can't just give it a textbook and say, "Go fix this bug." The robot needs a playground where it can try, fail, see what went wrong, and try again.

This paper introduces OpenSWE, which is essentially the world's biggest, most transparent, and highest-quality "software engineering playground" ever built.

Here is the breakdown of what they did, using some everyday analogies:

1. The Problem: The "Black Box" vs. The "Empty Gym"

Before OpenSWE, there were two problems for researchers trying to train these AI agents:

  • The Industrial "Black Box": Big tech companies had massive, perfect training environments, but they kept them locked away. It was like trying to learn to swim by watching a professional swimmer in a pool you can't enter.
  • The Open-Source "Empty Gym": The free tools available were like a gym with broken equipment. They were too small, the machines didn't work properly, or the exercises were too easy to be useful.

2. The Solution: Building a "Super Gym" (OpenSWE)

The team (from SII, GAIR, and SJTU) decided to build the ultimate gym from scratch.

  • The Scale: They built 45,320 distinct "rooms" (Docker environments). Each room contains a different software project with a specific bug to fix.
  • The Cost: This wasn't cheap. They spent about $1.5 million (roughly $900k to build the rooms and $576k to train the robots inside them).
  • The Transparency: Unlike the big companies, they threw open the doors. They released the blueprints (Dockerfiles), the rules (evaluation scripts), and the construction manual (the AI pipeline) so anyone can use it.

3. How They Built It: The "Construction Crew"

They didn't build these rooms by hand. They built a team of AI workers (a multi-agent system) that ran on a massive cluster of 64 computers. Think of this crew as a specialized construction team:

  • The Scout: Finds the bug in the code and reads the instructions.
  • The Architect: Builds the room (the Docker container) so the software runs correctly.
  • The Inspector: Checks if the room is safe and if the tests actually work.
  • The Referee: Runs the test. If the AI fixes the bug, the referee says "Pass." If not, they send it back to the Architect to fix the room.

4. The Secret Sauce: Quality Control (The "Filter")

Just having 45,000 rooms isn't enough. Some rooms are broken, and some are too easy (like asking a robot to tie its own shoelaces).

  • The "Unsolvable" Trap: Sometimes the bug description doesn't match the actual code. It's like a teacher giving a math problem with no solution. The AI would just spin its wheels.
  • The "Trivial" Trap: Sometimes the problem is so obvious (e.g., "Change 'ra' to 'raise'") that it teaches the AI nothing.
  • The Filter: They used a smart filtering process to throw out the broken rooms and the boring ones. They kept only the "Goldilocks" problems: hard enough to be challenging, but solvable enough to be a good lesson. This left them with about 13,000 high-quality training sessions.

5. The Results: The AI Gets Smarter

They trained their AI models (OpenSWE-32B and OpenSWE-72B) in this new gym and tested them on a famous exam called SWE-bench.

  • The Score: They got the highest scores ever recorded (62.4% and 66.0%), beating all previous methods.
  • The "No Saturation" Discovery: Usually, after a while, adding more data doesn't help (you hit a wall). But here, the more they trained, the better the AI got. It was like a student who never stops improving, no matter how many hours they study.
  • Bonus Skills: Interestingly, training on these coding problems made the AI better at math and science too! It seems that learning to debug code teaches the brain how to think logically, which helps in other areas. However, it didn't make the AI forget facts (like history dates), which is good.

Summary

OpenSWE is like building a massive, state-of-the-art driving school for AI cars. Instead of letting them crash into walls or drive on empty tracks, they built thousands of realistic, challenging driving scenarios. The result? The AI drivers learned faster, drove better, and even became better at navigating complex maps (math/science) than ever before. And the best part? They gave the blueprints to the whole world for free.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →