SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

Songcheng Cai, Zhiheng Lyu, Yuansheng Ni, Xiangchao Chen, Baichuan Zhou, Shenzhe Zhu, Yi Lu, Haozhe Wang, Chi Ruan, Benjamin Schneider, Weixu Zhang, Xiang Li, Andy Zheng, Yuyu Zhang, Ping Nie, Wenhu C

Published 2026-03-18

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

The Big Problem: The "Textbook" vs. The "Real World"

Imagine you are training a student to be a master detective.

The Old Way (Existing Benchmarks):
Most previous tests were like giving the student a textbook and asking, "Who killed Mr. Boddy in the library?" The student might get the answer right because they memorized the textbook (the "popular" code repositories everyone knows). But in the real world, the detective doesn't have a textbook; they have to walk into a messy, unfamiliar mansion, open drawers, check under rugs, and talk to witnesses to solve the mystery.

The problem is that current AI models are great at reciting facts from their memory but terrible at actually exploring a new, complex codebase to find the answer. They are "know-it-alls" who can't "do-it-alls."

The Solution: SWE-QA-Pro

The authors built a new training ground called SWE-QA-Pro. Think of this as a "Survival Reality Show" for AI agents.

1. The Arena: Long-Tail Repositories

Instead of testing the AI on famous, well-known projects (like the "Library" in our analogy), they picked 26 obscure, weird, and complex software projects (the "Messy Mansions").

Why? Because if an AI can solve a mystery in a weird, unknown mansion, it proves it actually knows how to investigate, not just how to recite facts.
The Twist: They made sure every "mansion" was fully built and functional. The AI can actually run the code, not just read it.

2. The Filter: Weeding Out the Cheaters

The authors realized that some questions are too easy. If you ask, "What does the print function do?", a smart AI can answer that from memory without looking at the code. That's cheating in a detective test.

So, they created a Difficulty Filter:

They asked the AI to answer a question without looking at the code.
If the AI got it right just by guessing or remembering, they threw the question away.
They only kept the questions where the AI had to open files, search through folders, and trace the logic to find the answer. This ensures the test measures exploration skills, not just memory.

3. The Training Recipe: From "Student" to "Detective"

The paper also introduces a new way to train small AI models to become these expert detectives. They used a two-step "Gym Routine":

Step 1: Supervised Fine-Tuning (SFT) - "The Classroom"
They taught the model the rules of the game. They showed it examples of how to use tools (like "Search," "View File," and "Run Command") to solve problems. It's like teaching a student how to use a magnifying glass and a notepad.
Step 2: Reinforcement Learning from AI Feedback (RLAIF) - "The Drill Sergeant"
This is the secret sauce. After the classroom, they let the model try to solve problems on its own.
- If the model guessed the answer without looking, it got a low score.
- If the model opened the right files, found the specific line of code, and cited its evidence, it got a high score.
- An "AI Judge" (a very smart AI) graded the model's work, rewarding it for being thorough and factual.

The Results: The Underdog Wins

The results were surprising. They took a relatively small, open-source model (Qwen3-8B) and trained it with this new "Reality Show" method.

The Outcome: This small, trained model beat GPT-4o (a massive, expensive, proprietary model) on their specific test.
The Lesson: You don't need a giant brain if you have the right training. Teaching an AI how to explore is more important than just making the AI bigger.

Summary Analogy

Old Benchmarks: Asking a student to recite the plot of a famous movie they've seen a thousand times.
SWE-QA-Pro: Putting the student in a dark room with a new, complex puzzle box and seeing if they can figure out how to open it by feeling the buttons and turning the knobs.
The Training: Instead of just giving the student the answer key, they are rewarded every time they successfully turn a knob or find a hidden compartment.

In short: The paper says, "Stop testing AI on what it already knows. Start testing it on how well it can learn and explore new things. And if you train it to do that, even small AIs can beat the giants."

1. Problem Statement

The field of Large Language Model (LLM)-assisted software engineering faces two critical gaps in evaluating and training models for repository-level code understanding:

Lack of Reliable Benchmarks: Existing benchmarks (e.g., SWE-Bench, SWE-QA) often rely on popular repositories where LLMs can "cheat" by recalling memorized knowledge from pretraining rather than genuinely exploring the codebase. They frequently overlook "long-tail" tasks (e.g., configuration, infrastructure glue) and include questions solvable without interacting with the code.
Ineffective Training for Agentic Behavior: Current training methods for open-source models primarily rely on Supervised Fine-Tuning (SFT) on static trajectories. They lack mechanisms to optimize for active exploration, tool usage, and multi-hop reasoning required to navigate complex, unfamiliar codebases. Consequently, small open models struggle to match proprietary models in repository-level tasks.

2. Methodology

The authors propose a two-pronged solution: a rigorous benchmark construction pipeline and a scalable agentic training recipe.

A. SWE-QA-Pro Benchmark Construction

The benchmark is constructed via a four-stage pipeline designed to eliminate memorization and ensure diversity:

Data Sourcing & Taxonomy:
- Analyzed 1.6M+ GitHub issues from 3,468 repositories (including long-tail projects).
- Used hierarchical K-Means clustering on issue embeddings to create a semantic taxonomy of 48 distinct task subclasses (e.g., "Dependency Injection," "Async Refactoring," "CI Pipelines").
Data Synthesis:
- Leveraged Claude Code to generate self-contained QA pairs based on the taxonomy. The agent explores the repository to create problems that require multi-file reasoning.
- Sampling Strategy: The test set covers 26 repositories (ensuring full taxonomy coverage), while the training set covers 1,484 repositories.
Difficulty Calibration (Crucial Step):
- To filter out "knowledge-only" questions, the authors compare Direct Answer baselines (no tools, single-turn) against Tool-Using runs.
- If a strong proprietary model (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro) can answer a question correctly without repository access, the question is discarded.
- Difficulty is defined by the negative consensus score of direct answers; only questions requiring genuine codebase exploration are retained.
Validation:
- Answers are cross-verified by Claude Code and human annotators to ensure correctness, completeness, and grounding in specific file paths/line numbers.
- Final Dataset: 260 high-quality questions from 26 long-tail repositories.

B. SWE-QA-Pro Agent & Training Recipe

The authors introduce a lightweight, ReAct-style agent workflow and a two-stage training recipe for small open models:

Agent Workflow:
- Abandons pre-built RAG indices.
- Uses explicit, length-controlled tools: SemanticSearch (keyword matching), View (file/directory inspection), and ExecuteCommand (read-only structural analysis).
- Operates in an iterative loop: Reason $\to$ Select Action $\to$ Observe $\to$ Repeat until sufficient evidence is gathered.
Two-Stage Training (SFT $\to$ RLAIF):
1. Supervised Fine-Tuning (SFT): Trains a base model (Qwen3-8B) on 1,000 high-quality, multi-turn trajectories generated by Claude Sonnet 4.5. This teaches tool syntax and basic usage patterns.
2. Reinforcement Learning from AI Feedback (RLAIF):
  - Uses 464 additional QA pairs for reinforcement learning.
  - Reward Model: An LLM-as-a-Judge evaluates generated answers against ground truth across five dimensions: Correctness, Completeness, Relevance, Clarity, and Reasoning Quality.
  - Optimization: Uses the GRPO algorithm. The reward function heavily weights correctness to prevent "reward hacking" (fluent but wrong answers).

3. Key Contributions

SWE-QA-Pro Benchmark: A new benchmark that enforces semantic diversity (via long-tail clustering) and interaction necessity (via difficulty calibration). It filters out ~90% of questions solvable by direct knowledge, isolating tasks that truly require agentic exploration.
Agentic Training Recipe: A scalable SFT $\to$ RLAIF pipeline that enables small open models (8B parameters) to learn efficient tool usage and complex reasoning without massive proprietary datasets.
Agent Framework: A RAG-free, ReAct-based agent that effectively navigates codebases using keyword search and scoped file inspection, proving effective even for smaller models.

4. Experimental Results

The authors evaluated 11 models (including proprietary and open-source) on SWE-QA-Pro.

Performance Gap: There is a massive performance gap between Direct Answering and Agent-Based workflows. For example, Claude Sonnet 4.5 improves by 13 points when using the agent, confirming that the benchmark successfully isolates tasks requiring code exploration.
Small Model Success:
- Qwen3-8B trained with the SFT $\to$ RLAIF recipe achieved a score of 5.96 (Overall) with the agent.
- This surpasses GPT-4o (5.59) by 2.3 points.
- It substantially narrows the gap to state-of-the-art proprietary models like Claude Sonnet 4.5 (7.34) and GPT-4.1 (6.86).
Training Strategy Analysis:
- Increasing SFT data from 1,000 to 1,464 trajectories yielded only modest gains.
- Adding the RLAIF stage resulted in significant improvements in Correctness and Completeness, demonstrating that RL provides a qualitatively different optimization signal that refines factual precision beyond what SFT alone can achieve.
Tool Usage: Models with higher scores (e.g., Claude Sonnet 4.5) utilized the highest volume of tool calls. Post-RL models showed improved judiciousness in tool usage rather than just frequency.

5. Significance

Redefining Evaluation: SWE-QA-Pro sets a new standard for repository-level QA by rigorously filtering out memorization artifacts, forcing models to demonstrate genuine code navigation and grounding capabilities.
Democratizing Agentic AI: The results prove that small open-source models (8B) can outperform massive proprietary models (GPT-4o) in complex software engineering tasks if trained with the right agentic workflow and RLAIF recipe.
Scalability: The proposed synthetic data pipeline and training recipe offer a scalable path to improving code understanding in open models without relying on expensive, proprietary data or massive parameter counts.

Limitations

Scale: The benchmark is limited to 260 questions due to the high cost of human verification.
Language: Currently restricted to Python due to the requirement for executable sandboxes, though the pipeline is language-agnostic.
Reward Hacking: The reliance on LLM-as-a-Judge for rewards carries a risk of models optimizing for the judge's preferences rather than objective correctness, though preliminary checks did not show significant gaming.