DeNovoSWE: Scaling Long-Horizon Environments for… — Plain-Language Explanation

Original authors: Jiale Zhao, Guoxin Chen, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen, Kai Jia

Published 2026-06-17

📖 4 min read☕ Coffee break read

Original authors: Jiale Zhao, Guoxin Chen, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen, Kai Jia

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a master architect who is incredibly talented at fixing a single broken brick in a wall. They can look at a specific crack, figure out why it happened, and patch it perfectly. This is what current AI coding assistants are good at: fixing small bugs in existing software.

But what if you asked that same architect to build an entire skyscraper from scratch, starting only with a vague sketch on a napkin? That is a much harder job. It requires planning the foundation, designing the elevators, wiring the electricity, and ensuring every room connects logically. Until now, AI has struggled with this "whole-building" task because it hasn't had enough practice.

The Problem: A Lack of Blueprints
The main reason AI struggles to build entire software "skyscrapers" (repositories) is a lack of training data. Most existing training data is like a list of "fix this leaky faucet" or "repair this cracked window." There is very little data that teaches an AI how to take a high-level description (like a user manual) and turn it into a fully functional, complex building from the ground up.

The Solution: DeNovoSWE
The paper introduces DeNovoSWE, a massive new training dataset designed specifically to teach AI how to build entire software projects from scratch. Think of it as a giant library of 4,818 "construction challenges." In each challenge, the AI is given a detailed set of instructions (documentation) and must build the entire software system to match those instructions perfectly.

How They Built the Training Data (The "Divide and Conquer" Strategy)
Building a dataset for this is hard because you can't just ask an AI to "write a manual for a complex app" and hope it's perfect. The authors used a clever, automated team of AI agents to do the heavy lifting:

The "Divide" Phase: Imagine trying to write a manual for a massive city. It's too big to do all at once. So, the AI first breaks the city down into neighborhoods (capabilities). It identifies what each neighborhood does and which buildings (code files) belong to it.
The "Conquer" Phase: Now, specialized AI agents write the manual for just one neighborhood at a time.
- The Draft Agent: Writes the first version of the manual.
- The Critic Agent: Acts like a strict editor. It checks the manual: "Did you forget to explain how the traffic lights work? Is the wiring diagram clear?"
- The Repair Agent: Fixes the mistakes the Critic found.
- This cycle repeats until the manual is perfect.
The "Golden" Test: Once the manual is written, the system creates a "clean room" (a sandbox). It takes the original code, throws it away, and leaves only the manual. Then, it asks a different AI agent to rebuild the software using only that manual. If the new software passes all the tests, the manual is considered high-quality and added to the dataset.

Keeping the AI Honest (Anti-Cheating)
A major concern is that the AI might "cheat" by peeking at the original code it was supposed to rebuild. The authors built a rigorous security system to prevent this:

They strip away all history, caches, and hidden files.
They block the AI from going online to download the original code.
They ensure the AI is truly "closed-book," forcing it to rely entirely on the documentation it was given.

The "Difficulty-Aware" Filter
Not every construction project is the same. Some are small sheds; others are skyscrapers. If you only keep the training examples where the AI built a perfect skyscraper on the first try, you lose all the valuable lessons from the hard projects where the AI almost succeeded but made a few mistakes.

The authors created a smart filter that looks at how hard a project is.

For easy projects (a small shed), they demand a perfect score.
For hard projects (a skyscraper), they accept a slightly lower score (e.g., 80% success) because getting a perfect skyscraper is incredibly difficult.
This ensures the AI learns from both easy wins and the "almost there" moments on difficult tasks.

The Results
When they taught an AI model using this new dataset, the results were dramatic.

Before training, the AI could only build about 5.8% of the required software correctly.
After training with DeNovoSWE, its success rate jumped to 47.2%.

This proves that by giving AI the right kind of "construction practice"—breaking big problems down, using strict editors, and learning from partial successes—we can teach it to move from being a simple "brick fixer" to a true "software architect."

DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

Technical Summary: DeNovoSWE

Problem Statement

Methodology: The DeNovoSWE Framework

1. Automated Pipeline Architecture

2. Evaluation and Leakage Prevention

3. Difficulty-Aware Trajectory Filtering

Key Contributions

Experimental Results

Significance