Why LLMs Fail: A Failure Analysis and Partial Success Measurement for Automated Security Patch Generation

Imagine you hire a very smart, well-read apprentice to fix a leaky roof. This apprentice has read every book on construction ever written and can speak perfect English. However, they have never actually held a hammer or seen a real storm.

This paper is a report card on that apprentice (an AI called a Large Language Model, or LLM) trying to fix security holes in computer code. The researchers wanted to see if the AI could patch up dangerous software bugs without breaking the house in the process.

Here is the breakdown of what they found, using some everyday analogies:

1. The Setup: The "Three-Point Check"

To see if the AI did a good job, the researchers didn't just ask, "Did you fix it?" They used a three-point checklist:

Did it compile? (Did the roof even stay on the house, or did the AI use the wrong nails?)
Is it secure? (Did the patch actually stop the rain, or did the AI just paint over the hole?)
Does it work? (Did the patch break the windows or the door while trying to fix the roof?)

2. The Big Surprise: The "Perfectly Broken" Patch

The most shocking finding is that only about 1 in 4 patches (24.8%) actually worked correctly.

But here is the scary part: More than half of the patches (51.4%) were "Insecure & Breaking."

The Analogy: Imagine the AI fixes a hole in the wall by gluing a heavy, jagged rock over it. The rock stays on (it compiles), but it blocks the door (breaks functionality) and the rain still gets in through the cracks (insecure).
The Cause: The AI didn't make a typo. It knew the grammar of the code perfectly. The problem was semantic misunderstanding. It didn't understand what the bug was; it just guessed a fix that sounded right but was fundamentally wrong.

3. The "Silent Killer": The 10% Risk

The researchers found a specific type of failure that is the most dangerous: Functional but Insecure.

The Analogy: The AI fixes the roof perfectly. The house looks great, the door opens, and the lights work. But the AI forgot to put a lock on the front door.
Why it's scary: If you run standard tests (checking if the lights work), the patch looks perfect. It passes the "CI/CD pipeline" (the automated quality check). But a hacker could walk right in.
The Stat: About 10% of the patches fell into this category. For certain types of bugs (like "Permissions"), this number jumped to 35%.

4. The "All-or-Nothing" Pattern

The researchers measured success on a scale from 0 to 1. They expected to see a bell curve—some patches being "almost right," some "okay," and some "great."

The Reality: The results were bimodal (two distinct peaks).
- Peak A: The AI got it 100% right.
- Peak B: The AI got it 0% right (or just barely functional but insecure).
- The Gap: There were almost no "near-misses." You rarely saw a patch that was "80% secure."
The Lesson: You can't just "tweak" a failed AI patch. If the AI doesn't understand the security concept, it's not a matter of "almost there." It's a matter of "completely missing the point." It's like asking a chef who doesn't know what salt is to "add a little salt." They might add sugar, or nothing at all.

5. What Makes a Bug Hard?

The researchers asked: "Are complex bugs harder to fix?"

The Answer: No. The size of the code or how complex the logic was didn't matter much.
The Real Factor: The Type of Bug.
- Easy: Bugs like "Infinite Loops" (a car stuck in a circle) were easy. The AI could see the pattern and break the loop (45% success rate).
- Hard: Bugs like "Input Validation" (checking if a user is lying about their age) were impossible for the AI (0% success rate).
Why? Fixing an infinite loop is mechanical (like following a recipe). Fixing input validation requires context and judgment (like knowing why someone might lie and what a "valid" age looks like in that specific situation). The AI lacks that real-world context.

6. The Takeaway: Don't Trust the AI Blindly

The paper concludes with a warning for anyone using AI to fix security holes:

Don't assume "Working" means "Safe." Just because the code runs doesn't mean the door is locked.
Human Review is Mandatory. You cannot deploy an AI patch without a human security expert double-checking it, especially for input validation and permissions.
The "Security Repair Score" (SRS): The authors created a new scoring system to measure partial success. It helps us see that while the AI is great at keeping the lights on (functionality), it is currently terrible at locking the doors (security).

In short: The AI is a brilliant mimic that speaks the language of code fluently, but it is currently a terrible security guard. It can build a beautiful wall, but it often forgets to put a gate in it, or worse, builds the gate in the wrong place. Until it learns to understand why a hole is dangerous, humans must remain the final gatekeepers.

1. Problem Statement

While Large Language Models (LLMs) have shown promise in Automated Program Repair (APR) for functional bugs (e.g., in Defects4J or SWE-bench), their effectiveness in generating security patches remains poorly understood.

The Gap: Traditional APR relies on functional test suites, which verify expected behavior but fail to detect security vulnerabilities (which often involve adversarial inputs).
The Risk: A patch can pass all functional tests (preserving functionality) while leaving the system exploitable. Recent studies suggest LLMs introduce vulnerabilities at nearly 9x the rate of human developers.
The Core Question: Do LLMs fail due to syntax errors, or do they fundamentally misunderstand the security semantics required to fix a vulnerability?

2. Methodology

The study employs a systematic failure analysis using a tri-axis evaluation framework on a dataset of 319 patches.

Dataset:
- Source: Vul4J benchmark (64 reproducible Java security vulnerabilities).
- Coverage: Spans 21 Common Weakness Enumeration (CWE) categories.
- Ground Truth: Includes vulnerable code, human patches, Proof-of-Vulnerability (PoV) tests, and developer test suites.
Model & Prompting:
- Model: Gemini 2.0 Flash (Note: The text mentions "Gemini 3.0 Flash" in section 4.2 but the abstract and arXiv header suggest a specific version; the study uses a zero-shot prompt).
- Prompt: A minimal prompt instructing the LLM to fix a specific CWE-ID while preserving original functionality, without detailed remediation instructions.
- Volume: 5 patches per vulnerability (Temperature 0.7), totaling 320 attempts (319 evaluated after excluding one timeout).
Evaluation Protocol (Tri-Axis):
1. Compilation: Does the code compile?
2. Security: Does the patch pass the PoV test (dynamic) and reduce Semgrep warnings (static)?
3. Functionality: Does the patch pass the full developer test suite?
Metrics:
- Security Score ( $S_{score}$ ): Binary PoV pass/fail weighted by Semgrep warning reduction.
- Functionality Score ( $F_{score}$ ): Ratio of passed tests.
- Security Repair Score (SRS): A novel continuous metric defined as $C \times (0.5 \cdot S_{score} + 0.5 \cdot F_{score})$ , where $C$ is a compilation flag. This allows for measuring "partial success."
Analysis Tools:
- Semgrep: Static analysis for residual security issues.
- Lizard: Code complexity analysis (Lines of Code, Cyclomatic Complexity).
- Correlation: Pearson and Spearman coefficients to identify predictors of repair difficulty.

3. Key Contributions

Failure Taxonomy: A classification system for LLM security patches, distinguishing between compilation errors, functionality breaks, and the critical "Insecure & Breaking" category.
Security Repair Score (SRS): A continuous metric that quantifies partial progress, moving beyond binary pass/fail to capture the nuance of "functional but insecure" patches.
Empirical Evidence of Semantic Failure: Demonstration that the primary failure mode is not syntax generation but semantic misunderstanding of the vulnerability.
Difficulty Predictors: Identification that vulnerability type (CWE) and human patch size are strong predictors of difficulty, while traditional code complexity metrics are not.

4. Key Results

A. Overall Performance (RQ1)

Full Correctness: Only 24.8% of patches were fully correct (passed all axes).
Dominant Failure: 51.4% of patches were "Insecure & Breaking" (failed both security and functionality).
Semantic Gap: Over half of the failures resulted from applying fundamentally incorrect repair strategies, not syntax errors.
Compilation: LLMs are proficient at syntax (86.8% compile rate), but this does not translate to security correctness.
The "Silent" Risk: 10.3% of patches were functional but insecure. These are the most dangerous as they would pass standard CI/CD pipelines and only be caught by specific security validation. This rate spikes to 35% for Permission (CWE-264) vulnerabilities.

B. Partial Success & Bimodality (RQ2)

Asymmetry: LLMs preserve functionality well (Mean $F_{score} = 0.832$ ) but struggle with security (Mean $S_{score} = 0.251$ ).
No Trade-off: There is no significant correlation between fixing security and breaking functionality ( $r=0.267$ ). The failures are independent; fixing security does not inherently require sacrificing functionality.
Bimodal Distribution: The SRS distribution is bimodal:
- Peak 1: Perfect success (SRS $\approx$ 1.0).
- Peak 2: Functional but insecure (SRS $\approx$ 0.5).
- Gap: Almost no "near-success" cases (0.3% fall in the 0.8–1.0 range). This suggests LLM security patching is an "all-or-nothing" capability; incremental prompt refinement is unlikely to turn a failed patch into a successful one.

C. Difficulty Predictors (RQ3)

Vulnerability Type is Critical: Fix rates vary wildly by CWE:
- 0% Fix Rate: Input Validation (CWE-20), despite high compilation and functionality scores. This requires deep domain context.
- 45% Fix Rate: Infinite Loop (CWE-835), suggesting LLMs handle "mechanical" logic better than "semantic" validation.
Human Patch Size: A significant negative correlation ( $\rho = -0.331$ ) exists between the size of the human patch and LLM success. Larger, more complex fixes are harder for LLMs.
Code Complexity: Traditional metrics (Lines of Code, Cyclomatic Complexity) showed no significant correlation with repair difficulty, reinforcing that the bottleneck is semantic understanding, not structural complexity.

5. Significance and Implications

For Practitioners:
- Rigorous Validation is Mandatory: LLM-generated security patches cannot be trusted based on functional tests alone. PoV tests and static analysis are essential.
- Targeted Review: Human review should prioritize "semantic" vulnerabilities (Input Validation, Permissions) where LLMs fail systematically, rather than "mechanical" ones.
- Reject the Trade-off Myth: Teams should not assume that fixing security will break functionality; if a patch breaks functionality, it is likely a bad patch, not a necessary trade-off.
- Iterative Prompting Limits: Due to the bimodal nature of success, simply tweaking prompts on a failed patch is unlikely to yield a "near-success" result.
For Researchers:
- Shift Focus: Future work should move from syntax generation to vulnerability comprehension.
- Specialized Routing: A "CWE-aware" approach that routes different vulnerability types to specialized repair strategies (e.g., chain-of-thought for complex semantic fixes) could improve performance.
- Training Data: Security-aware training data and reasoning capabilities are the critical path forward, rather than just more code data.

Conclusion

The paper concludes that while LLMs are syntactically competent, they currently lack the semantic reasoning required to reliably fix security vulnerabilities. The dominance of "Insecure & Breaking" patches and the bimodal success pattern indicate that current LLMs are not yet ready for autonomous security patching without rigorous, security-specific human oversight.