Imagine you're hiring a new architect to build a house.
The Old Way (Previous Benchmarks):
You give the architect a single blueprint: "Build a kitchen with a sink and a stove." They hand you a kitchen. You check the sink and stove, and they work! You give them a gold star.
The problem? You don't know if they used cheap materials that will rot in a month, or if they built the kitchen in a way that makes it impossible to add a dining room later. In the real world, houses aren't built in one day; they are lived in, changed, and expanded over decades. The old tests only checked if the house was "finished" on day one, not if it could survive a family growing up in it.
The New Way (SWE-CI):
This paper introduces a new test called SWE-CI. Instead of asking an AI to build a kitchen once, they say:
"Here is a house as it was in 2020. Over the next 71 days, the family will need to add a nursery, then a home office, then a solar panel system, and finally a second floor. Your job isn't just to build the first room; it's to keep the whole house standing, safe, and easy to expand for the next 71 days."
The Core Idea: "The House That Keeps Changing"
The researchers built a benchmark using 100 real-world software projects (like popular Python libraries). They didn't just look at the start and end points; they looked at the entire journey between them.
- The Timeline: On average, each task covers 233 days of real history with 71 updates (commits).
- The Challenge: The AI has to act like a software team that doesn't just fix a bug and leave. It has to keep the code "healthy" while adding new features, fixing old ones, and making sure the new stuff doesn't break the old stuff.
How They Test the AI: The "Architect and Builder" Team
To make this realistic, they didn't just ask the AI to "fix it." They split the AI into two roles, mimicking a real software company:
- The Architect (The Brain): This agent looks at the broken parts of the house (failing tests) and says, "Okay, the roof is leaking, and we need a new window. Let's write a plan to fix the leak first, but don't worry about the window yet."
- The Builder (The Hands): This agent takes the plan and actually writes the code to fix the leak.
They do this in a loop: Plan → Build → Test → Plan → Build.
If the Builder fixes the leak but accidentally knocks down a wall while doing it, the Architect has to notice that in the next round and fix the wall. This cycle repeats dozens of times.
The Score: "The Future-Proof Score" (EvoScore)
In old tests, if the code works at the end, you get 100%. In SWE-CI, they use a special score called EvoScore.
Think of it like a credit score for code quality:
- If the AI fixes a problem today but makes the code so messy that fixing a different problem next week becomes a nightmare, their score goes down.
- If the AI fixes the problem today in a clean, organized way that makes next week's work easy, their score goes up.
They even have a "regression" check. If the AI fixes a bug but accidentally breaks a feature that was working perfectly before, that's a "regression" (like fixing a leak but causing the pipes to burst). The paper found that most AI models are terrible at this; they often break more things than they fix when working over a long period.
What Did They Find?
- AI is getting better, but not there yet: Newer AI models are much better at this than older ones, but they still struggle with the "long game." They are great at quick fixes but often create "technical debt" (messy code) that hurts them later.
- Different companies have different styles: Some AI models (like those from Claude) seem to care more about long-term stability, while others rush to fix the immediate problem and ignore the future mess.
- The "Zero-Regression" Problem: Most AI models fail to keep the code stable. In the test, most models broke existing features more than 75% of the time when trying to evolve the code over a long period.
The Bottom Line
This paper is a wake-up call. It tells us that while AI is amazing at writing code for a single task, it's still learning how to be a good software engineer who cares about the long-term health of a project.
SWE-CI is the new gym where we train AI not just to lift heavy weights (write code), but to run a marathon (maintain code) without tripping over its own shoelaces.