Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software

Here is an explanation of the paper "Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software," translated into simple language with creative analogies.

The Big Idea: The "Bad Habit" of AI Coders

Imagine you hire a very talented, super-fast chef (the AI) to cook thousands of different meals for a restaurant. The chef is amazing at speed and creativity, but they have a strange quirk: they always use the same slightly dangerous knife technique whenever they chop onions, no matter what dish they are making.

If you only look at the finished plate (the front of the website), you see a beautiful salad. You don't see the knife. But if you know which chef made it, you can predict with high certainty that the salad was cut with that dangerous technique, even if you never saw the kitchen.

This paper introduces a tool called FSTab (Feature–Security Table) that does exactly this for software. It proves that when AI models write code, they don't just make random mistakes. They develop predictable bad habits that repeat over and over again.

The Problem: The "Black Box" Kitchen

Usually, when security experts check software, they need to look inside the code (the kitchen) to find bugs. This is like needing a master key to walk into the kitchen and check the knives.

But in the real world, many companies use AI to build software, and they often don't give you the source code (the recipe book). You only see the website or app (the finished meal). This is called a "Black Box."

The researchers asked: If we can't see the code, can we still guess where the security holes are just by looking at what the app does?

The Solution: The "Cheat Sheet" (FSTab)

The researchers built a "Cheat Sheet" called FSTab. Here is how it works, step-by-step:

1. The Training Phase (Learning the Chef's Habits)

First, the researchers asked an AI to write 1,000 different websites (like a bakery, a bank, a social media site). They then looked at the code to see where the AI messed up.

The Discovery: They found that whenever the AI was asked to build a "Login Page," it almost always forgot to put a lock on the door. Whenever it built a "File Upload," it almost always left a window open.
The Pattern: The AI wasn't making random mistakes. It was following a specific, flawed template for every specific feature.

2. The Attack Phase (Using the Cheat Sheet)

Now, imagine a hacker wants to break into a new website. They don't have the source code.

Step 1: They look at the website's front page. They see a "Login" button and a "Search" bar.
Step 2: They check the Cheat Sheet (FSTab) for the specific AI model that built the site (e.g., "GPT-5.2").
Step 3: The Cheat Sheet says: "If you see a Login button on a GPT-5.2 site, there is a 90% chance the backend has a specific type of security hole."
Result: The hacker knows exactly where to strike without ever seeing the code.

The Four "Fingerprints" of AI Mistakes

The paper measures how stubborn these bad habits are using four creative concepts:

Feature Recurrence (The "Same Song, Different Lyrics"):
Does the AI make the same mistake every time it builds a "Login" feature, even if the rest of the code looks different? Yes. It's like a singer who always hits the same wrong note on the word "love," no matter what song they are singing.
Rephrasing Persistence (The "Stubborn Chef"):
If you ask the AI to "Build a login" vs. "Create a sign-in page" vs. "Make a user entry system," does it still make the same mistake? Yes. The AI is so stuck in its ways that changing the words you use doesn't change the bad code it writes.
Domain Recurrence (The "Specialty Shop"):
Does the AI make the same mistakes in a "Banking App" as it does in a "Blog"? Sometimes. It has specific bad habits for specific types of tasks (like handling money), but it might be safer when writing a blog.
Cross-Domain Transfer (The "Universal Bad Habit"):
This is the scariest part. The researchers found that if they learned the AI's bad habits from a "Blog," they could use that knowledge to hack a "Banking App" it built later. The bad habits are so deep in the AI's brain that they travel across completely different types of software.

The Results: The "Universality Gap"

The study tested top AI models (like GPT-5.2, Claude, and Gemini). They found something shocking:

The AI is more predictable than a human. A human programmer might make a mistake once and learn from it. The AI, however, seems to have a "hardwired" flaw.
High Success Rate: Using their Cheat Sheet, the researchers could predict hidden security holes with up to 94% accuracy, even when they had never seen that specific type of software before.

Why This Matters (The Takeaway)

Think of AI-generated software like a mass-produced toy. If a toy factory has a defect in its mold, every single toy coming off the line will have that same defect. You don't need to inspect every toy individually; you just need to know which mold was used.

The paper warns us:

AI isn't just "randomly" bad. It has specific, repeatable security flaws.
We can predict these flaws. Just by looking at the outside of an app, we can guess what's broken inside if we know which AI built it.
We need new defenses. We can't just rely on checking the code after it's written. We need to fix the "molds" (the AI models) themselves so they stop baking these dangerous patterns into every piece of software they create.

In short: The paper shows that AI coders have "muscle memory" for making mistakes, and we can now use that knowledge to find the weak spots in software without ever needing to see the code.

Here is a detailed technical summary of the paper "Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software."

1. Problem Statement

Large Language Models (LLMs) are increasingly used for automated code generation, yet their outputs often contain security vulnerabilities. While existing research focuses on detecting vulnerabilities in individual code snippets (post-hoc defense), it largely fails to address a critical blind spot: vulnerability persistence.

The authors hypothesize that because LLMs rely on probabilistic sampling of a limited set of canonical templates, they tend to reproduce the same insecure design choices across different programs, prompts, and application domains. This creates a predictable "attack surface" where an attacker can infer hidden backend vulnerabilities solely by observing the frontend features of a deployed application, without access to the source code. Current black-box security assessments lack the ability to model these cross-program regularities induced by the generating model.

2. Methodology: Feature–Security Table (FSTab)

The core contribution of the paper is FSTab, a framework designed to exploit the recurrence of vulnerabilities in LLM-generated code. It operates in two modes: as a black-box attack and as a model-centric evaluation framework.

A. The Attack Mechanism (Black-Box)

FSTab functions as a probabilistic lookup table that maps observable frontend features (e.g., "User Login," "File Upload") to latent backend vulnerabilities (e.g., SQL Injection, Missing Rate Limiting).

Reconnaissance: An attacker interacts with the deployed UI to identify visible features.
Mapping: These features are mapped to a standardized schema.
Querying: Using the known identity of the source LLM (e.g., GPT-5.2, Claude-4.5), the attacker queries the specific FSTab for that model to retrieve the top-k most probable backend vulnerabilities associated with those features.
Goal: To predict backend security flaws without source code access.

B. Construction & Training

FSTab is built by generating a large corpus of applications using a target LLM and labeling them with static analysis tools (CodeQL, Semgrep).

Scoring Metric: Instead of simple frequency counts, the authors use Pointwise Mutual Information (PMI) to score the association between a feature $f$ and a vulnerability rule $r$ . This distinguishes model-specific patterns from generic coding errors.
$S_{PMI}(f, r) = \log \frac{\hat{P}(r|f)}{\hat{P}(r)}$
Diversity Selection: A greedy algorithm with a diversity penalty ( $\lambda$ ) is used to select the top- $k$ rules for each feature, preventing "super-nodes" (common rules appearing under all features) and ensuring the table captures discriminative mappings.

C. Evaluation Metrics

To quantify the stability of these vulnerabilities, the paper introduces four metrics:

Feature Vulnerability Recurrence (FVR): Measures how often a specific feature triggers the same vulnerability across different programs.
Rephrasing Vulnerability Persistence (RVP): Measures if vulnerabilities persist when the same task is described with different prompts (semantic rephrasing).
Domain Vulnerability Recurrence (DVR): Measures recurrence within the same application domain (e.g., E-commerce).
Cross-Domain Transfer (CDT): Measures if vulnerabilities learned in one domain can predict vulnerabilities in a completely different domain (testing the "Universality" of the fingerprint).

3. Key Contributions

Universal Black-Box Attack: The introduction of FSTab, a novel attack vector that allows adversaries to prioritize and infer backend vulnerabilities based solely on frontend functionality and model identity.
Model-Centric Evaluation Framework: A rigorous framework (FVR, RVP, DVR, CDT) to measure and compare how consistently different LLMs reproduce specific security weaknesses.
Empirical Characterization: A comprehensive analysis of six state-of-the-art code LLMs (including GPT-5.2, Claude-4.5 Opus, Gemini-3 Pro, and Grok) across five diverse application domains.

4. Experimental Results

The authors evaluated FSTab on 1,050 LLM-generated software programs across five domains (E-commerce, Internal Tools, Social Media, Blogging, Dashboards).

Attack Success: The attack is highly effective. In "Held-out" (unseen) testing, models like Composer and Gemini-3 Flash achieved 100% Attack Success Rate (ASR) in the E-commerce domain.
Cross-Domain Transfer: The results show strong generalization. Even when the target domain was excluded from the training data, FSTab achieved up to 94% ASR and 93% vulnerability coverage (e.g., on Claude-4.5 Opus in Internal Tools).
The "Universality Gap": A key finding is that Cross-Domain Transfer (CDT) scores consistently exceed Domain Recurrence (DVR) scores (mean gap +18.3%). This indicates that vulnerability patterns are intrinsic to the model's generation logic rather than specific to the domain's requirements.
Model Specificity:
- Composer showed the highest Rephrasing Vulnerability Persistence (RVP $\approx$ 50%), indicating its insecure patterns are deeply embedded and robust to prompt variations.
- Grok showed the lowest RVP (11.96%), suggesting its vulnerabilities are more sensitive to prompt wording, yet it still exhibited high cross-domain transfer.
Specific Fingerprints: The paper provides detailed "fingerprint" tables (Appendix E.4.1) showing, for example, that "Register New Account" in GPT-5.2 consistently maps to js/prototype-pollution-utility, while "User Login" in Grok maps to js/redos and js/clear-text-storage.

5. Significance and Implications

New Threat Surface: The paper exposes a previously under-explored attack surface where the identity of the code generator combined with observable UI features is sufficient to compromise a system.
Shift in Defense Strategy: Traditional static analysis treats code in isolation. This work argues for model-centric risk evaluation, suggesting that security auditing must account for the specific "personality" and recurring biases of the LLM used.
Proactive Mitigation: The findings suggest that developers and organizations should:
- Avoid relying on a single LLM for critical security features.
- Implement feature-conditioned regression tests for high-recurrence features (e.g., always testing "Login" for specific injection patterns if using a specific model).
- Reduce template rigidity in model training to break these persistent vulnerability loops.

In conclusion, the paper demonstrates that LLM-generated software is not just a collection of random errors but a system with predictable, model-specific security fingerprints. This allows for systematic, large-scale exploitation in black-box settings, fundamentally changing how software security must be approached in the era of AI-assisted development.