daVinci-Env: Open SWE Environment Synthesis at Scale

Imagine you want to teach a robot how to be a master software engineer. You can't just give it a textbook and say, "Go fix this bug." The robot needs a playground where it can try, fail, see what went wrong, and try again.

This paper introduces OpenSWE, which is essentially the world's biggest, most transparent, and highest-quality "software engineering playground" ever built.

Here is the breakdown of what they did, using some everyday analogies:

1. The Problem: The "Black Box" vs. The "Empty Gym"

Before OpenSWE, there were two problems for researchers trying to train these AI agents:

The Industrial "Black Box": Big tech companies had massive, perfect training environments, but they kept them locked away. It was like trying to learn to swim by watching a professional swimmer in a pool you can't enter.
The Open-Source "Empty Gym": The free tools available were like a gym with broken equipment. They were too small, the machines didn't work properly, or the exercises were too easy to be useful.

2. The Solution: Building a "Super Gym" (OpenSWE)

The team (from SII, GAIR, and SJTU) decided to build the ultimate gym from scratch.

The Scale: They built 45,320 distinct "rooms" (Docker environments). Each room contains a different software project with a specific bug to fix.
The Cost: This wasn't cheap. They spent about $1.5 million (roughly $900k to build the rooms and $576k to train the robots inside them).
The Transparency: Unlike the big companies, they threw open the doors. They released the blueprints (Dockerfiles), the rules (evaluation scripts), and the construction manual (the AI pipeline) so anyone can use it.

3. How They Built It: The "Construction Crew"

They didn't build these rooms by hand. They built a team of AI workers (a multi-agent system) that ran on a massive cluster of 64 computers. Think of this crew as a specialized construction team:

The Scout: Finds the bug in the code and reads the instructions.
The Architect: Builds the room (the Docker container) so the software runs correctly.
The Inspector: Checks if the room is safe and if the tests actually work.
The Referee: Runs the test. If the AI fixes the bug, the referee says "Pass." If not, they send it back to the Architect to fix the room.

4. The Secret Sauce: Quality Control (The "Filter")

Just having 45,000 rooms isn't enough. Some rooms are broken, and some are too easy (like asking a robot to tie its own shoelaces).

The "Unsolvable" Trap: Sometimes the bug description doesn't match the actual code. It's like a teacher giving a math problem with no solution. The AI would just spin its wheels.
The "Trivial" Trap: Sometimes the problem is so obvious (e.g., "Change 'ra' to 'raise'") that it teaches the AI nothing.
The Filter: They used a smart filtering process to throw out the broken rooms and the boring ones. They kept only the "Goldilocks" problems: hard enough to be challenging, but solvable enough to be a good lesson. This left them with about 13,000 high-quality training sessions.

5. The Results: The AI Gets Smarter

They trained their AI models (OpenSWE-32B and OpenSWE-72B) in this new gym and tested them on a famous exam called SWE-bench.

The Score: They got the highest scores ever recorded (62.4% and 66.0%), beating all previous methods.
The "No Saturation" Discovery: Usually, after a while, adding more data doesn't help (you hit a wall). But here, the more they trained, the better the AI got. It was like a student who never stops improving, no matter how many hours they study.
Bonus Skills: Interestingly, training on these coding problems made the AI better at math and science too! It seems that learning to debug code teaches the brain how to think logically, which helps in other areas. However, it didn't make the AI forget facts (like history dates), which is good.

Summary

OpenSWE is like building a massive, state-of-the-art driving school for AI cars. Instead of letting them crash into walls or drive on empty tracks, they built thousands of realistic, challenging driving scenarios. The result? The AI drivers learned faster, drove better, and even became better at navigating complex maps (math/science) than ever before. And the best part? They gave the blueprints to the whole world for free.

1. Problem Statement

The development of capable autonomous Software Engineering (SWE) agents requires large-scale, executable, and verifiable environments (e.g., Docker containers) that provide dynamic feedback loops for iterative code editing and testing. However, the field faces two critical bottlenecks:

Scalability and Opacity: Existing open-source datasets (e.g., SWE-bench) are limited in scale and repository diversity. Industrial solutions achieve scale but remain opaque with unreleased infrastructure, creating a prohibitive barrier for academic research.
Quality and Difficulty Distribution: Simply scaling the number of environments is insufficient. Synthesized environments often suffer from:
- PR-Issue Misalignment: The submitted patch does not actually resolve the described issue (making the task effectively unsolvable).
- Triviality: The issue description directly reveals the solution, providing no meaningful learning signal.
- Existing datasets lack systematic filtering to remove these low-value instances, leading to inefficient training.

2. Methodology: The OpenSWE Framework

The authors introduce OpenSWE, a fully transparent framework for SWE agent training built on a multi-agent synthesis pipeline deployed across a 64-node distributed cluster. The process involves four main stages:

A. Data Collection and Filtering

Source: Collected ~572,114 GitHub Pull Requests (PRs) from Python repositories.
Filtering Pipeline: Applied a four-stage filter to ensure quality:
1. Repository Viability: Retained only repos with $\ge$ 5 stars (ensuring maturity).
2. Language Filter: Restricted to Python.
3. Issue Requirement: Required linked issues with descriptions (no empty issues).
4. Substantive Changes: Excluded PRs modifying only test files; required changes to non-test code.

B. Multi-Agent Synthesis Pipeline

A multi-agent system automates the construction of executable environments:

Repository Exploration Agent: Performs bounded, cost-aware exploration of local checkouts to extract setup instructions (e.g., README.md, dependency manifests) without redundant traversal.
Dockerfile Construction Agent: Generates containerized environments.
- Optimizations: Uses pre-built openswe-python base images (covering Python 2.7–3.14) to avoid network timeouts; injects code via COPY instead of cloning to bypass API rate limits; employs layer-aware prompting to cache stable base layers.
Evaluation Script Construction Agent: Generates bash scripts to verify repairs. Unlike static scripts, these are synthesized dynamically to include specific test cases tied to the issue and handle new test scenarios.
Test Analysis Agent: Inspects execution logs to diagnose failures (e.g., Docker misconfiguration vs. unsolvable tasks) and provides feedback to refine the Dockerfile or evaluation script iteratively.

C. Infrastructure and Scale

Cluster: 64 Elastic Compute Service (ECS) nodes (Intel Xeon, 128GB RAM, 4TB SSD).
Fault Tolerance: Uses a decoupled, data-parallel architecture with a shared filesystem message queue to prevent single-node failures from halting the job. Automated cleanup daemons manage "zombie" containers.
Cost: Total investment of $1.47 million ($891K for construction, $576K for trajectory sampling/curation).

D. Quality-Centric Filtering

After construction, a difficulty-aware curation process filters the dataset:

Removes unsolvable instances (PR-Issue misalignment).
Removes trivial instances.
Retains only environments at the "appropriate difficulty frontier" to maximize learning efficiency.
Final Output: ~13,000 curated trajectories derived from ~9,000 high-quality environments (out of 45,320 initial executable environments spanning 12.8k repos).

3. Key Contributions

Unprecedented Scale with Full Transparency: Release of 45,320 executable Docker environments from 12.8k repositories, with all Dockerfiles, evaluation scripts, and the distributed synthesis pipeline fully open-sourced.
Quality-Centric Filtering: Introduction of a pipeline that characterizes environment difficulty, filtering out unsolvable/trivial cases to yield ~13,000 high-quality training trajectories.
Comprehensive Infrastructure: A robust, fault-tolerant multi-agent synthesis system deployed on a 64-node cluster, enabling reproducibility and community-driven improvements.

4. Experimental Results

The authors trained OpenSWE-32B and OpenSWE-72B models (based on Qwen2.5-Base) and evaluated them on SWE-Bench Verified.

State-of-the-Art Performance:
- OpenSWE-32B: Achieved 62.4% Pass@1.
- OpenSWE-72B: Achieved 66.0% Pass@1.
- These results establish a new SOTA among supervised fine-tuning (SFT) methods in the Qwen2.5 series, outperforming previous bests (e.g., SWE-Master-32B-RL at 61.4% and daVinci-Dev-72B at 58.5%).
Comparison with SWE-rebench: Models trained on OpenSWE consistently outperformed those trained on SWE-rebench across all scales and agent scaffolds (OpenHands and SWE-Agent). The 32B model showed a 12.2% absolute improvement over SWE-rebench in the SWE-Agent setting.
Data Scaling Trends:
- Performance follows a log-linear scaling trend with no observed saturation, indicating that further data scaling will yield continued gains.
- Larger models (72B) benefit more from scaling than smaller models (32B).
Generalization:
- Code: Massive gains on HumanEval (+29 points for 32B) and HumanEval+.
- Reasoning: Significant improvements in Mathematical reasoning (MATH-500 +12.2 points for 72B) and Science benchmarks (SuperGPQA +8.1 points).
- Factual Recall: No degradation in MMLU or TriviaQA, confirming that SWE training enhances procedural reasoning without harming factual knowledge.

5. Significance

Democratizing SWE Research: By open-sourcing the entire infrastructure and synthesis pipeline, OpenSWE removes the "black box" barrier of industrial solutions, allowing academic groups to replicate and build upon large-scale SWE agent training.
Quality over Quantity: The paper demonstrates that difficulty-aware curation is as critical as raw data scale. Filtering out unsolvable and trivial tasks significantly boosts training efficiency.
Cross-Domain Transfer: The results suggest that training on complex, iterative software engineering tasks improves general reasoning capabilities (math and science), suggesting a strong link between code debugging and logical problem-solving.
Future Direction: The lack of saturation in scaling laws motivates the community to pursue even larger-scale environment synthesis, with OpenSWE providing the necessary blueprint and tools.

Availability: All environments, scripts, and infrastructure details are available at https://github.com/GAIR-NLP/OpenSWE.