Original authors: Heleno de Souza Campos Junior, Leonardo Gresta Paulino Murta

Published 2026-05-19✓ Author reviewed ⓘ

📖 5 min read🧠 Deep dive

Original authors: Heleno de Souza Campos Junior, Leonardo Gresta Paulino Murta

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you and a friend are both editing the same document at the same time. You both make changes to the same paragraph, and when you try to combine your work, the computer throws up its hands and says, "I don't know which version to keep!" This is called a merge conflict.

For decades, developers have had to manually fix these conflicts, which is tedious and prone to mistakes. Recently, two new "smart helpers" have emerged to solve this problem automatically. This paper is a head-to-head race between these two helpers to see which one is better.

The Two Contenders

Think of the two helpers as having very different personalities and skill sets:

1. The "Super-Reader" (LLM-based approach, represented by MergeGen)

How it works: This helper is like a brilliant student who has read millions of books and code documents. It doesn't really "calculate" the answer; instead, it uses its memory of how things usually look to guess the best solution. It predicts the next word or line based on patterns it has learned.
The Analogy: It's like a chef who has tasted thousands of soups. If you give it a recipe with a missing ingredient, it doesn't measure the spices; it just "knows" what the soup should taste like based on experience and adds the right amount.

2. The "Puzzle Solver" (Search-based approach, represented by SBCR)

How it works: This helper is a methodical engineer. It doesn't know what code means; it just sees lines of text. It treats the conflict like a giant jigsaw puzzle. It tries millions of different combinations of the existing lines, checking each one to see which mix looks the most like the original versions. It uses a simple rule: "The best solution is usually a mix that looks somewhat like both parents."
The Analogy: It's like a detective who has no idea who the suspect is, so they try every possible combination of alibis and clues until they find the one that fits the facts perfectly. It doesn't guess; it tests.

The Race: What Happened?

The researchers pitted these two against thousands of real-world conflicts from open-source projects (like Java, C#, and JavaScript code). Here is what they found:

1. The "Super-Reader" wins when things are messy.
When the two versions of the code were very different in size (e.g., one version added a huge paragraph while the other deleted a single line), the Super-Reader was amazing. Because it learned from so much data, it could understand the context and pick the right lines, even if the balance was weird. It was also much faster, solving conflicts in a blink of an eye.

2. The "Puzzle Solver" wins when things are balanced.
When the two versions were similar in size and structure, the Puzzle Solver was the champion. It found the perfect mix of lines more often than the Super-Reader. It was also more reliable when the code contained weird symbols, non-English text, or was extremely long.

3. The "Super-Reader" has a few bad habits.

Memory Leaks: Sometimes, the Super-Reader got "stuck" on a specific example it had seen before in its training. It would just repeat that answer, even if it was wrong for the current situation. This is called overfitting—it memorized the test instead of learning the lesson.
Short Attention Span: If the code chunk was too huge, the Super-Reader would get overwhelmed and stop writing halfway through, leaving the conflict half-solved.
Language Barrier: If the code had comments in a language the model wasn't trained on, it got confused.

4. The "Puzzle Solver" is a bit slow but steady.
It takes longer to solve the puzzle because it has to test many combinations. However, it never gets confused by long text or strange languages because it treats everything as simple text. It doesn't "memorize" anything, so it doesn't overfit.

The Big Conclusion: No "Silver Bullet"

The paper concludes that neither helper is perfect on its own.

If you give the Super-Reader a small, messy conflict, it's a genius.
If you give the Puzzle Solver a huge, balanced, or weirdly formatted conflict, it's the reliable workhorse.

The Solution?
The authors suggest building a hybrid system—a "Traffic Cop" that looks at the conflict first.

If the conflict is small and messy, the Traffic Cop sends it to the Super-Reader.
If the conflict is huge, balanced, or contains weird characters, the Traffic Cop sends it to the Puzzle Solver.

By letting the right tool do the right job, we can create a system that is both fast and accurate, saving developers from the headache of manual merging.

Summary in One Sentence

This paper proves that while AI "guessers" are fast and great at messy problems, "searchers" are more reliable for complex or weird ones, and the best future tool will be a smart combination of both.

Technical Summary: LLM-based vs. Search-based Merge Conflict Resolution

Problem Statement

In modern collaborative software development, merge conflicts arise when concurrent modifications overlap in code regions. While the majority of these conflicts (approximately 87%) are resolved by combining existing lines from conflicting versions without writing new code, the process remains time-consuming and error-prone. The research community has recently emerged with two competing paradigms to automate this resolution: Generative AI (GenAI) based on Large Language Models (LLMs) and Search-Based Software Engineering (SBSE) based on heuristic optimization. While tools from both paradigms show promise, their relative strengths, weaknesses, and fundamental trade-offs in real-world scenarios were previously unexplored.

Methodology

This study presents the first in-depth empirical comparison between these two paradigms, evaluating MergeGen (a state-of-the-art LLM-based tool) against SBCR (a novel SBSE approach using a Random Restart Hill Climbing algorithm).

Scope: The evaluation focused specifically on "combination-based" conflicts, where the resolution involves interleaving existing lines from two versions without generating new code. This scope was chosen to ensure a fair comparison, as SBCR cannot generate new code, whereas MergeGen can.
Datasets: The study utilized thousands of real-world conflicts from open-source projects in four languages: Java, C#, JavaScript, and TypeScript. Two primary datasets were used:
- Dataset1: 6,269 Java conflicts.
- Dataset2: 47,363 conflicts across the four languages (filtered for combination-based resolutions).
Experimental Design:
- MergeGen: Configured with a CodeT5 encoder-decoder model, trained on language-specific data. Input and output token limits were set to 300 and 100, respectively, due to computational constraints.
- SBCR: Configured via systematic parameter tuning (neighbors per iteration, execution time, stagnation limit) to optimize the balance between solution quality and execution time.
- Metrics: Primary metrics included Similarity (measured via Gestalt pattern matching/LCS against the developer's actual resolution) and Execution Time. Statistical significance was assessed using the Wilcoxon Signed-Rank test and Common Language Effect Size (CLES).
- Generalization: The study evaluated performance when models were trained/tuned on one dataset and tested on another to assess adaptability.
- Qualitative Analysis: A manual inspection of 100 extreme cases (50 where SBCR won, 50 where MergeGen won) was conducted to identify patterns explaining performance differences.

Key Contributions and Findings

1. Performance Comparison (RQ1 & RQ2)

Accuracy: The GenAI paradigm (MergeGen) consistently outperformed the SBSE paradigm (SBCR) in terms of resolution similarity across all languages (Java, C#, JavaScript, TypeScript). MergeGen achieved a median similarity of 100% and a perfect match rate of 55% in Java, compared to SBCR's 86.1% median and 19.6% perfect match rate.
Speed: MergeGen was significantly faster, with a median generation time of 0.3 seconds versus SBCR's 1.3 seconds.
Statistical Significance: The differences were statistically significant ( $p < 0.001$ ) across all languages, with MergeGen showing a 70.6% probability of generating a more similar resolution in a random Java conflict.

2. Generalization Capabilities (RQ3)

SBSE Robustness: SBCR demonstrated superior generalization. Its performance remained stable regardless of whether it was tuned on the same dataset or a completely different dataset (cross-dataset evaluation). It is data-independent and does not suffer from training distribution shifts.
GenAI Sensitivity: MergeGen showed slight sensitivity to its training data. While it still outperformed SBCR in cross-dataset scenarios, its performance dropped slightly when trained on a different dataset, suggesting a degree of overfitting to specific project styles or patterns.

3. Contextual Strengths and Weaknesses (RQ4)

Qualitative analysis revealed distinct failure and success modes for each paradigm:

MergeGen Strengths: Excels in imbalanced conflicts (e.g., one version is significantly larger than the other) and scenarios involving whitespace or removed content. It leverages learned contextual patterns to infer the correct unbalanced resolution.
MergeGen Weaknesses: Struggles with non-English content, large inputs (leading to truncation due to token limits), and empty candidates. The study identified potential overfitting, where the model appeared to memorize specific repetitive conflicts rather than learning generalizable strategies.
SBCR Strengths: Performs optimally on balanced conflicts where the two versions are of similar size. It is language-agnostic and robust against non-English content or malformed chunks.
SBCR Weaknesses: Its evaluation function (which maximizes similarity to both parents) struggles with highly imbalanced conflicts, often producing incorrect resolutions that attempt to balance the content rather than reflecting the developer's intent.

Significance and Claims

The paper concludes that neither paradigm is a "silver bullet." Instead, they exhibit fundamental, context-dependent trade-offs:

GenAI (MergeGen) offers high accuracy and speed for common, imbalanced, or pattern-matching conflicts but risks overfitting and fails catastrophically (e.g., truncation) on inputs outside its training distribution or token limits.
SBSE (SBCR) offers robust, data-independent generalization and handles large or balanced inputs well but lacks the contextual understanding to resolve highly imbalanced conflicts effectively.

The authors advocate for the development of hybrid systems that intelligently route conflicts based on their characteristics. They propose a workflow where a "meta-resolver" directs imbalanced or pattern-based conflicts to MergeGen, while routing large, balanced, or non-English conflicts to SBCR. This approach aims to leverage the complementary strengths of both paradigms to create more robust and reliable automated merge conflict resolution tools.

The study emphasizes that relying on a single paradigm may be insufficient for real-world software development, where conflict scenarios vary widely in size, content balance, and language.

LLM-based vs. Search-based Merge Conflict Resolution: An Empirical Study of Competing Paradigms