SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training

Imagine you are trying to teach a very smart but inexperienced apprentice how to fix broken machines in a giant, chaotic factory. This is what the paper SWE-Fuse is all about: teaching Artificial Intelligence (AI) to fix real-world software bugs.

Here is the story of how they did it, using simple analogies.

The Problem: The "Confusing Instruction Manual"

Usually, when we teach an AI to fix code, we give it a "ticket" (a bug report) that says, "The machine is making a weird noise." The AI then tries to fix it.

But in the real world, these tickets are often garbage.

The Mismatch: Sometimes the ticket says, "The engine is overheating," but the actual broken part is a loose screw in the radio.
The Result: The AI gets confused. It spends hours trying to fix the engine, only to realize the radio was the problem all along. It wastes time and learns the wrong lessons because the "instruction manual" (the bug description) was lying or unclear.

The Solution: SWE-Fuse (The Smart Training Camp)

The researchers built a new training system called SWE-Fuse. Instead of just reading the confusing tickets, they taught the AI a better way to learn. They used two main tricks:

Trick 1: The "Blindfolded Detective" (Issue-Free Learning)

Imagine you are training a detective. Instead of telling them, "The butler did it," you just show them the crime scene and say, "Find out what happened."

How it works: SWE-Fuse takes some training examples and hides the bug description. It forces the AI to look at the code, run tests, see what breaks, and figure out the problem on its own.
The Analogy: It's like teaching a chef to cook by giving them a list of ingredients and a broken dish, rather than a recipe that says "Add salt." The chef learns to taste and adjust (debug) rather than just blindly following a potentially wrong recipe.
Why it helps: This stops the AI from getting distracted by bad instructions. It learns the process of fixing things, not just memorizing the answers to specific questions.

Trick 2: The "Adaptive Coach" (Entropy-Aware RL)

Once the AI has learned the basics, they put it in a "Reinforcement Learning" gym. This is where the AI tries to fix bugs, gets a score, and tries again.

Usually, coaches are too strict or too loose.

Too strict: If the AI is confused (high "entropy" or uncertainty), a strict coach punishes it for trying new things. The AI gets scared to explore.
Too loose: If the AI is confident (low entropy), a loose coach lets it make wild guesses that might break everything.

SWE-Fuse's Coach is "Entropy-Aware":

When the AI is confused: The coach says, "Hey, you're not sure? Go ahead and try wild ideas! Explore!" (Relaxed rules).
When the AI is confident: The coach says, "You know what you're doing? Stick to the plan and be precise." (Strict rules).
The Analogy: Think of it like a parent teaching a child to ride a bike. When the child is wobbling and unsure, the parent holds the seat tight but encourages them to pedal. When the child is riding smoothly, the parent lets go and says, "Keep your balance, don't swerve!"

The Results: The Underdog Wins

The researchers tested this new training method on a famous challenge called SWE-bench, which is like the "Olympics" for AI code-fixing.

The Old Way: Other AI models (even very big ones) struggled. They got confused by the messy real-world tickets and often failed.
The SWE-Fuse Way: Their model, even though it was smaller (32 billion "brain cells" compared to some giants with hundreds of billions), won.
- It solved 60.2% of the problems.
- The next best open-source model only solved about 54%.
- It even beat some massive, expensive, closed-source models from tech giants.

The Big Takeaway

The paper proves that you don't need the biggest, most expensive AI to fix software. You just need to teach it how to think rather than what to read.

By teaching the AI to ignore bad instructions and learn by doing (debugging step-by-step), and by coaching it with the right amount of freedom at the right time, they created a software repair expert that is smarter, faster, and more reliable than the competition.

In short: SWE-Fuse taught the AI to be a detective who solves crimes by looking at the evidence, rather than a student who just memorizes the answer key (even when the answer key is wrong).

Here is a detailed technical summary of the paper SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training.

1. Problem Statement

The paper addresses a critical bottleneck in training Large Language Model (LLM) agents for software engineering (SWE) tasks: the scarcity and inconsistency of high-quality issue descriptions.

Noise and Misalignment: Real-world datasets (like SWE-bench) often contain mismatches between issue descriptions and the actual code solutions (gold patches). As illustrated in the paper's Figure 2, an issue description might discuss a warning handling error, while the corresponding patch fixes a completely unrelated TIFF image encoding logic.
Consequences: These inconsistencies introduce noise that misleads automated agents, causing them to explore incorrect reasoning paths. Furthermore, high-quality issue-PR pairs are difficult to acquire at scale, with many existing datasets containing empty or vague problem statements.
Limitation of Current Methods: Existing agents often rely heavily on these flawed textual descriptions, limiting their ability to derive effective solutions through systematic debugging.

2. Methodology: SWE-Fuse Framework

The authors propose SWE-Fuse, an issue-description-aware training framework designed to fuse issue-description-guided samples with issue-free samples. The framework consists of two primary modules:

A. Issue-Free-driven Trajectory Learning Module

This module aims to teach the model step-by-step debugging processes without relying on potentially misleading text descriptions.

Multi-step Trajectory Construction:
- The system constructs a sandbox environment using 33,000+ issues from 128 GitHub repositories.
- It employs a "Teacher Agent" (Gemini 3) to generate high-quality, multi-turn reasoning-action trajectories (ReAct paradigm: Reasoning $\to$ Action $\to$ Observation).
- A special <THOUGHT> token is injected to explicitly separate reasoning traces from actions, facilitating the learning of internal thought processes.
Trajectory Data Filtering:
- Git Exploitation Prevention: Filters out trajectories where agents cheat by inspecting git logs (git show, git log) to find ground-truth patches. It also sanitizes commit histories to ensure agents cannot access future fixes.
- Rule-based Filtering: Removes short trajectories (<5 rounds), those lacking intermediate reasoning steps, and non-English content.
Issue-Free-driven Supervised Fine-Tuning (SFT):
- The framework creates a mixed dataset ( $D_{mixed}$ ) containing both standard issue samples and issue-free samples.
- Issue-Free Samples: These contain the repository and test cases but no issue description. The model must identify the bug and fix it solely by observing test failures and iterating.
- This approach forces the model to learn systematic debugging patterns rather than memorizing text-to-code mappings, mitigating the noise from inaccurate descriptions.

B. Entropy-aware RLVR Training Module

To further refine the policy, the authors introduce a Reinforcement Learning with Verifiable Rewards (RLVR) module that adapts training dynamics based on the model's uncertainty.

Group Sampling & Relative Advantage: Uses Group Relative Policy Optimization (GRPO) principles (specifically RLOO - Leave-One-Out) to calculate advantages without a separate critic model.
Entropy Normalization: Calculates the sequence-level entropy ( $H$ ) of the model's output distribution. High entropy indicates uncertainty; low entropy indicates confidence.
Entropy-Adaptive Clipping:
- High Entropy (Uncertain): Applies relaxed clipping (larger $\epsilon$ ) to encourage exploration and allow the model to try new strategies.
- Low Entropy (Confident): Applies stricter clipping (smaller $\epsilon$ ) to prevent sudden distribution shifts and ensure training stability.
- This dynamic mechanism balances exploration and exploitation, leading to faster convergence.

3. Key Contributions

SWE-Fuse Framework: A novel training paradigm that fuses issue-guided and issue-free samples, enabling models to learn robust debugging capabilities even when issue descriptions are noisy or missing.
SWE-Fuse Trajectory Dataset: The release of a high-quality dataset comprising 14,350 validated trajectories. This dataset includes both standard and issue-free samples, constructed with rigorous filtering to prevent "git hacking" and ensure data integrity.
Entropy-Aware RLVR: A new training algorithm that dynamically adjusts clipping constraints based on policy entropy, improving stability and convergence in agentic RL tasks.
State-of-the-Art Performance: Demonstrates that lightweight models (8B and 32B) trained with SWE-Fuse can outperform significantly larger or more complex models.

4. Experimental Results

The framework was evaluated on the SWE-bench Verified benchmark.

Performance vs. Open-Source Models:
- SWE-Fuse-8B: Achieved a 43.0% solve rate (vs. 39.0% for the next best 8B baseline, Klear-Agent).
- SWE-Fuse-32B: Achieved a 60.2% solve rate, surpassing the previous best 32B baseline (CWM-32B at 53.9%) by 6.3 percentage points.
Test-Time Scaling (TTS):
- When combined with TTS (sampling multiple trajectories and selecting the best), performance improved further:
  - 8B Model: 49.8% solve rate.
  - 32B Model: 65.2% solve rate.
Comparison with Closed-Source Models:
- SWE-Fuse-32B (60.2%) outperformed OpenAI-o3 (58.4%) and approached the performance of Claude-4-Sonnet (66.6%) and GPT-5 (71.8%), despite having significantly fewer parameters.
Ablation Studies:
- Data Scaling: Performance increased monotonically with trajectory size, showing a 2.9x improvement from 0 to 14k samples.
- Issue-Free Ratio: Optimal performance was achieved with a 25-50% mix of issue-free samples. Purely issue-free training degraded performance, confirming the need for a balanced approach.
- RLVR Impact: Models initialized with SFT (cold-start) converged significantly faster and achieved higher final rewards than those trained with RL from scratch.

5. Significance

Democratizing SWE Agents: SWE-Fuse proves that smaller, more efficient models (32B) can achieve state-of-the-art results in complex software engineering tasks, reducing the computational cost and barrier to entry for developing capable coding agents.
Robustness to Noise: By leveraging issue-free trajectories, the framework effectively mitigates the "garbage in, garbage out" problem associated with noisy real-world issue descriptions, teaching agents to rely on test-case verification rather than textual hallucination.
Training Stability: The entropy-aware RLVR mechanism offers a generalizable solution for stabilizing reinforcement learning in agentic environments, where high variance and exploration are common challenges.
Practical Application: The framework enables agents to generate reproduction scripts and iteratively debug code, a capability demonstrated in case studies where SWE-Fuse successfully resolved complex, multi-step issues that other models (like Claude-4) failed to fix due to incomplete reasoning or test-case manipulation.