SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Imagine you're hiring a new architect to build a house.

The Old Way (Previous Benchmarks):
You give the architect a single blueprint: "Build a kitchen with a sink and a stove." They hand you a kitchen. You check the sink and stove, and they work! You give them a gold star.
The problem? You don't know if they used cheap materials that will rot in a month, or if they built the kitchen in a way that makes it impossible to add a dining room later. In the real world, houses aren't built in one day; they are lived in, changed, and expanded over decades. The old tests only checked if the house was "finished" on day one, not if it could survive a family growing up in it.

The New Way (SWE-CI):
This paper introduces a new test called SWE-CI. Instead of asking an AI to build a kitchen once, they say:
"Here is a house as it was in 2020. Over the next 71 days, the family will need to add a nursery, then a home office, then a solar panel system, and finally a second floor. Your job isn't just to build the first room; it's to keep the whole house standing, safe, and easy to expand for the next 71 days."

The Core Idea: "The House That Keeps Changing"

The researchers built a benchmark using 100 real-world software projects (like popular Python libraries). They didn't just look at the start and end points; they looked at the entire journey between them.

The Timeline: On average, each task covers 233 days of real history with 71 updates (commits).
The Challenge: The AI has to act like a software team that doesn't just fix a bug and leave. It has to keep the code "healthy" while adding new features, fixing old ones, and making sure the new stuff doesn't break the old stuff.

How They Test the AI: The "Architect and Builder" Team

To make this realistic, they didn't just ask the AI to "fix it." They split the AI into two roles, mimicking a real software company:

The Architect (The Brain): This agent looks at the broken parts of the house (failing tests) and says, "Okay, the roof is leaking, and we need a new window. Let's write a plan to fix the leak first, but don't worry about the window yet."
The Builder (The Hands): This agent takes the plan and actually writes the code to fix the leak.

They do this in a loop: Plan → Build → Test → Plan → Build.
If the Builder fixes the leak but accidentally knocks down a wall while doing it, the Architect has to notice that in the next round and fix the wall. This cycle repeats dozens of times.

The Score: "The Future-Proof Score" (EvoScore)

In old tests, if the code works at the end, you get 100%. In SWE-CI, they use a special score called EvoScore.

Think of it like a credit score for code quality:

If the AI fixes a problem today but makes the code so messy that fixing a different problem next week becomes a nightmare, their score goes down.
If the AI fixes the problem today in a clean, organized way that makes next week's work easy, their score goes up.

They even have a "regression" check. If the AI fixes a bug but accidentally breaks a feature that was working perfectly before, that's a "regression" (like fixing a leak but causing the pipes to burst). The paper found that most AI models are terrible at this; they often break more things than they fix when working over a long period.

What Did They Find?

AI is getting better, but not there yet: Newer AI models are much better at this than older ones, but they still struggle with the "long game." They are great at quick fixes but often create "technical debt" (messy code) that hurts them later.
Different companies have different styles: Some AI models (like those from Claude) seem to care more about long-term stability, while others rush to fix the immediate problem and ignore the future mess.
The "Zero-Regression" Problem: Most AI models fail to keep the code stable. In the test, most models broke existing features more than 75% of the time when trying to evolve the code over a long period.

The Bottom Line

This paper is a wake-up call. It tells us that while AI is amazing at writing code for a single task, it's still learning how to be a good software engineer who cares about the long-term health of a project.

SWE-CI is the new gym where we train AI not just to lift heavy weights (write code), but to run a marathon (maintain code) without tripping over its own shoelaces.

Here is a detailed technical summary of the paper "SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration."

1. Problem Statement

While Large Language Model (LLM) agents have demonstrated strong capabilities in static, one-shot software engineering tasks (e.g., fixing a single bug or generating a function), existing benchmarks fail to capture the reality of professional software development.

The Gap: Real-world software development is a long-term process characterized by complex requirement changes, feature iterations, and continuous maintenance.
Limitation of Current Benchmarks: Existing paradigms (e.g., SWE-bench, HumanEval) rely on snapshot-based evaluation. An agent receives a static requirement and produces a one-shot solution. This fails to measure maintainability, as an agent that writes brittle, hard-coded fixes may pass the same tests as one writing clean, extensible code. The consequences of poor design (technical debt) only become visible when the codebase must evolve over time.
Core Challenge: There is a lack of benchmarks that evaluate an agent's ability to sustain code quality through long-term evolution, where past decisions compound and affect future modifications.

2. Methodology

The authors propose SWE-CI, the first repository-level benchmark built upon a Continuous Integration (CI) loop paradigm.

A. Task Formalization: Evolution-Based Evaluation

Unlike snapshot-based approaches where the requirement $r$ is static ( $r \equiv \text{require}(c_0, c^*)$ ), SWE-CI uses an iterative loop:

Dynamic Requirement Generation: At each step $i$ , the requirement $r_i$ is derived from the gap between the current codebase $c_i$ and the target "oracle" codebase $c^*$ .
Iterative Update: The agent modifies the codebase to produce $c_{i+1} = \text{code}(c_i, r_i)$ .
Consequence Propagation: This ensures that errors or design flaws in early iterations propagate to subsequent steps, making long-term decision quality observable.

B. Data Curation

The dataset consists of 100 tasks drawn from 68 distinct Python repositories on GitHub.

Selection Criteria: Repositories must be actively maintained (>3 years), have >500 stars, possess unit tests, and use permissive licenses.
Span: Each task represents a transition between a base commit and an oracle commit, spanning an average of 233 days and 71 consecutive commits.
Filtering: The process involves extracting maximal subsequences with unchanged dependencies, constructing Docker environments, and filtering for significant code changes (>1,000 lines modified) and test gaps (>5 passing tests difference).

C. Dual-Agent Evaluation Protocol

To simulate professional software teams, SWE-CI employs an Architect–Programmer workflow:

Architect Agent: Analyzes the test gap between the current code and the oracle. It performs Summarize (identify root causes), Locate (pinpoint source files), and Design (create a high-level requirement document). It limits output to 5 urgent, high-level requirements to prevent over-designing.
Programmer Agent: Receives the Architect's requirements and executes Comprehend, Plan, and Code steps to implement changes.
Loop: This cycle repeats (up to 20 iterations) until the agent passes all tests associated with the target commit.

D. Metrics: Normalized Change & EvoScore

Normalized Change ( $a(c)$ ): A metric ranging from -1 to 1 that measures progress relative to the baseline ( $c_0$ ) and target ( $c^*$ ). It penalizes regressions (breaking previously passing tests) more severely than it rewards improvements, using asymmetric normalization.
EvoScore (Evolution Score): A weighted mean of normalized changes over $N$ $N$ iterations:
$e = \frac{\sum \gamma^i a(c_i)}{\sum \gamma^i}$
- $\gamma \ge 1$ : Later iterations are weighted more heavily. This rewards agents that maintain code quality as evolution progresses (long-term stability) rather than those that rush early gains but accumulate technical debt.

3. Key Contributions

Paradigm Shift: Introduces the first benchmark shifting evaluation from static correctness to dynamic maintainability via a CI-loop.
Realistic Dataset: Provides 100 high-fidelity tasks representing months of real-world development history, capturing complex dependency management and long-term feature evolution.
Novel Evaluation Protocol: The Architect-Programmer dual-agent setup mimics real-world CI processes, separating requirement analysis from implementation to better diagnose agent failures.
New Metric (EvoScore): Proposes a future-weighted scoring system that explicitly values long-term code stability over short-term test passing.

4. Experimental Results

The authors evaluated 18 models from 8 providers (including Claude, GPT, GLM, DeepSeek, etc.) using over 10 billion tokens of compute.

Observation 1: Accelerating Progress: Code maintenance capabilities are improving rapidly. Newer models (post-2026) significantly outperform predecessors. Claude Opus leads the pack, followed by GLM-5.
Observation 2: Provider Bias: Different providers exhibit distinct strategies regarding maintainability.
- Long-term focus: MiniMax, DeepSeek, and GPT tend to optimize for long-term gains (higher scores when $\gamma > 1$ ).
- Short-term focus: Kimi and GLM tend to prioritize immediate returns.
- Stability: Qwen, Doubao, and Claude show consistent performance regardless of the weighting parameter.
Observation 3: Regression Struggles: Current LLMs struggle to control regressions (breaking existing functionality).
- Most models have a zero-regression rate below 0.25.
- Only two Claude-Opus models exceeded a 0.5 zero-regression rate.
- This indicates that while LLMs are good at snapshot fixes, they fail to reliably maintain code stability over extended, multi-round development cycles.

5. Significance

Diagnostic Value: SWE-CI reveals that "functional correctness" is insufficient for evaluating agents in real-world scenarios. It exposes the "technical debt" accumulation that static benchmarks miss.
Research Direction: It provides a clear signal to the AI community that future model training must prioritize long-term planning, architectural consistency, and regression avoidance rather than just immediate test passing.
Standardization: By establishing a rigorous, repository-level benchmark with a CI-loop, SWE-CI sets a new standard for evaluating the true "engineering" capability of AI agents, moving beyond simple code generation toward sustainable software evolution.