Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models

Imagine you have hired a brilliant, super-fast detective (the Large Language Model, or LLM) to solve a mystery. Your goal is for this detective to find a single, tiny mistake in a massive library of books (the code) and tell you exactly which page and line the mistake is on. This is called Fault Localization.

For a long time, we've been testing these detectives by giving them puzzles they've already seen in their training books. But that's like giving a detective a "Where's Waldo" puzzle they've solved a thousand times before. Of course, they'll get it right! But does that mean they are actually good at solving new mysteries, or are they just memorizing answers?

This paper is like a stress test for these AI detectives. The researchers wanted to see: If we change the way the story is told, but keep the plot exactly the same, will the detective still find the mistake?

Here is the breakdown of their experiment using simple analogies:

1. The Setup: Creating a "Fake" Crime Scene

The researchers didn't use old, known puzzles. Instead, they took thousands of clean, working programs (like a perfectly written recipe for a cake) and injected a specific error (like accidentally adding salt instead of sugar).

They then asked the AI: "Here is the recipe and what the cake is supposed to taste like. Can you find the line where I put the salt?"

2. The Twist: The "Semantic-Preserving" Magic Trick

This is the most important part. Once the AI successfully found the salt in the original recipe, the researchers played a trick on them. They applied Semantic-Preserving Mutations (SPMs).

Think of this like rearranging a room without changing the furniture.

Original Room: A chair is in the corner.
The Trick: They paint the chair blue, move a rug, add a fake plant, and rename the chair "The Blue Throne."
The Reality: The chair is still in the exact same spot. The room functions exactly the same.

In code terms, they:

Renamed variables (e.g., changed count to index).
Added misleading comments (e.g., writing "This code is for dragons" when it's actually for a calculator).
Inserted "dead code" (lines of code that never run, like a door that is painted shut).

The Question: If the AI is truly smart and understands the logic of the recipe, it should ignore the paint and the fake plant and still point to the salt. If it's just skimming the surface, it will get confused by the new paint job.

3. The Results: The Detectives Got Confused

The results were shocking. Even though the "plot" of the code didn't change at all:

78% of the time, the AI failed.
When the researchers added "dead code" (fake plants), the AI's accuracy dropped to about 20%.
When they added "misleading comments" (fake signs), the AI got tricked easily.

It turns out, the AI detectives were not reading the story deeply. They were skimming the cover and the font. If you changed the font color or added a weird sticker to the cover, they forgot what the story was about.

4. Other Interesting Findings

The "First Page" Bias: The AI was much better at finding mistakes on the first 25% of the code (the beginning of the book) and terrible at finding them in the last 25%. It's like a reader who gets tired and stops paying attention after the first few chapters.
Python vs. Java: The AI struggled more with Java (a very strict, verbose language) than Python (a more flexible language), likely because Java has more "words" to get lost in.
Newer isn't Always Better: Even the newest, most expensive AI models (like the latest Claude or Gemini) only got slightly better at this. They are still easily tricked by simple tricks.

The Big Takeaway

This paper is a wake-up call. It tells us that while AI is amazing at writing code (generating a story), it is currently not very good at reasoning about code (solving a mystery).

The AI is like a student who memorized the answers to a math test but doesn't actually understand algebra. If you change the numbers slightly or write the question in a different handwriting, they fail.

What needs to happen?
We need to teach these AI models to stop looking at the "font" and the "stickers" and start understanding the logic underneath. Until they can do that, we can't fully trust them to fix bugs in our critical software, because a simple change in how the code looks could make them miss a disaster.

Here is a detailed technical summary of the paper "Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models," accepted at ICST 2026.

1. Problem Statement

While Large Language Models (LLMs) are increasingly used for software maintenance tasks like Fault Localization (FL), their ability to reason about program semantics remains poorly understood. Current evaluation methods suffer from three critical limitations:

Data Contamination: Widely used benchmarks (e.g., Defects4J, BugsInPy) are likely part of LLM training data, leading to overly optimistic performance metrics due to memorization rather than reasoning.
Lack of Specifications: Many existing datasets lack natural language specifications, making it difficult to determine if an LLM's localization is based on true semantic understanding or superficial cues.
Scalability and Rigor: Existing studies often rely on small-scale, manual, or qualitative evaluations that do not scale to the complexity of real-world code or the volume required for statistical significance.

The core problem is that LLMs may rely on syntactic patterns (e.g., variable names, comments, code position) rather than deep semantic reasoning, making them fragile when code is altered in ways that do not change functionality.

2. Methodology

The authors propose an automated, end-to-end evaluation framework designed to generate unseen fault localization tasks and test the robustness of LLM reasoning.

A. Seed Program Procurement

Source: 1,307 seed programs (637 Python, 670 Java) from public datasets (e.g., CodeSearchNet) that include natural language specifications.
Criteria: Programs must be >50 lines of code, have clear specifications, and be free of pre-existing faults to avoid data contamination.

B. Fault Injection (Creating Unseen Tasks)

Dynamic Injection: Instead of using pre-existing bugs, the framework injects faults into seed programs using standard mutation operators:
- Off-by-one errors
- Premature returns
- Boolean logic swaps
- Operator swaps
Filtering: A "counterexample-driven existential filter" is applied. If no LLM can localize the fault under the given specification, the task is discarded as underspecified.

C. Robustness Evaluation via Semantic-Preserving Mutations (SPMs)

This is the core innovation. For every program an LLM localizes correctly, the framework applies Semantic-Preserving Mutations (SPMs) to create variants ( $P_F + SPM$ ) that preserve the original logic but alter non-functional elements.

Mutation Types:
- Misleading Comments: Inserting coherent but false comments.
- Misleading Variable Names: Renaming variables to obscure their function.
- Dead Code Injection: Adding unreachable code blocks.
- Function Shuffling: Reordering function definitions (Java only).
Process: The LLM is re-prompted to localize the same fault in the mutated program without context of the previous attempt. If the LLM fails, it indicates reliance on superficial cues rather than semantics.

D. Scale

Dataset: 750,013 unique fault-localization tasks.
Scope: 245 million lines of code (LOC) and ~3.8 billion tokens.
Models Evaluated: 10 state-of-the-art LLMs (including GPT-4o, Claude 3.7/4.5 Sonnet, Gemini 1.5/2.0/2.5, Llama 3.1, Phi-4, Qwen variants).

3. Key Contributions

First Large-Scale Empirical Study: The first investigation of LLM fault localization robustness at this scale (750k+ tasks), moving beyond code generation benchmarks.
Novel Evaluation Framework: An automated system that dynamically generates faults and SPMs, eliminating data contamination and enabling rigorous, scalable testing.
Identification of Fragility: Demonstrated that LLMs are highly sensitive to non-functional code changes, revealing a disconnect between syntactic processing and semantic reasoning.
Comprehensive Benchmarking: Provided a new dataset and evaluation protocol that addresses the limitations of current benchmarks (contamination, lack of specs, scalability).

4. Key Results

A. Robustness Collapse (RQ1)

78% Failure Rate: When SPMs are applied to programs an LLM previously localized correctly, the model fails to localize the same fault in 78% of cases.
Conclusion: LLMs rely heavily on surface-level features (comments, variable names) rather than program semantics.

B. Impact of Mutation Types (RQ2)

Dead Code: The most disruptive mutation, reducing average accuracy to 20.38%. LLMs often flag the dead code as the fault.
Misleading Comments: Caused a significant drop (25.63% accuracy), indicating LLMs over-rely on documentation.
Variable Renaming: Least disruptive (29.02% accuracy), suggesting LLMs are somewhat robust to identifier changes but not structural noise.
Function Reordering (Java): Caused an 83% accuracy drop, highlighting a severe "positional bias" where logic deep in the context window is ignored.

C. Fault Location Bias (RQ3)

Positional Bias: LLMs are significantly better at localizing faults in the first 25% of the code (56% of correct localizations) compared to the final 25% (6%).
Context Decay: Accuracy declines linearly as faults appear further down the code, suggesting attention mechanisms struggle with long-sequence context retention.

D. Model Performance & Evolution (RQ4, RQ5)

Closed vs. Open Source: Closed-source models (Claude, Gemini, GPT-4) generally outperformed open-source models.
Language Differences: Models performed better on Python than Java, likely due to training data diversity and Python's concise syntax.
Longitudinal Trends: Newer model versions (e.g., Gemini 2.5, Claude 4.5) showed only marginal improvements (1–2%) in fault localization robustness, suggesting that current scaling laws and retraining are not fundamentally solving semantic reasoning gaps.

5. Significance and Implications

Reliability Gap: The findings suggest that current LLMs are not yet reliable for autonomous software maintenance. A model that works on "clean" code may fail catastrophically on real-world codebases containing dead code, legacy comments, or refactored structures.
Need for New Representations: The paper argues that treating code as plain text is insufficient. Future LLMs may need to incorporate structured intermediate representations (e.g., Control Flow Graphs, Code Property Graphs) to reason about semantics independent of syntax.
Benchmarking Shift: The community must move away from static, potentially contaminated benchmarks toward dynamic, specification-driven, and robustness-focused evaluation frameworks like the one proposed.

Data Availability: The framework, datasets, and code are publicly available on Zenodo (DOI: 10.5281/zenodo.18803969).