Imagine you are a chef trying to cook a new recipe. You ask a super-smart, AI-powered sous-chef for help. The sous-chef confidently tells you, "You need to buy SuperSpice-9000 from the grocery store!" You go to the store, but SuperSpice-9000 doesn't exist.

In the world of computer coding, this "grocery store" is a digital warehouse called PyPI (for Python) or npm (for JavaScript). These warehouses hold millions of pre-made code "ingredients" (packages) that programmers can download with a single command.

This paper is a follow-up to a scary story told last year. Back then, researchers found that AI chefs were very bad at naming ingredients. They would invent fake names like "SuperSpice-9000" about 5% to 22% of the time. A sneaky thief could register a malicious package with that fake name, wait for a programmer to ask the AI for it, and then trick the programmer into installing a virus. This is called "slopsquatting."

The author of this paper, an independent researcher, asked: "Has the AI gotten better at this two years later?"

Here is what they found, explained simply:

1. The "Fake Ingredient" Problem Got Smaller, But Didn't Go Away

The researchers tested the five smartest AI coding models available in early 2026 (from companies like Anthropic, OpenAI, Google, and DeepSeek).

The Good News: The gap between the "best" AI and the "worst" AI has shrunk dramatically. In 2024, some AIs were terrible (22% fake names) while others were okay (5%). In 2026, they are all roughly the same: they all make up fake names about 4.6% to 6.1% of the time. The "spread" of badness has collapsed.
The Bad News: The threat is still very real. Even though the rate dropped, 4–6% is still high enough for a thief to make a profit. If an AI makes a fake name 1 in 20 times, a thief can still register that fake name and wait for thousands of programmers to accidentally download it.

2. The "Universal Fake" Discovery

This is the paper's biggest surprise. The researchers found 127 specific fake names that all five of the top AI models invented.

The Analogy: Imagine asking five different expert chefs, "What is the secret ingredient in this soup?" and they all independently say, "It's BlueFlavor-7," even though that ingredient doesn't exist.
The Danger: If a thief registers "BlueFlavor-7" once, they can attack users of all five AI companies simultaneously. It's a "universal trap" that doesn't depend on which AI you use.

3. A Few Weird Twists

The paper found some patterns that were the opposite of what we expected:

Python vs. JavaScript: In 2024, the AI was worse at naming JavaScript ingredients. In 2026, it's actually worse at naming Python ingredients. The AI seems to be getting confused by the messy naming rules of Python.
The "Small" vs. "Big" Brother: Usually, smaller, cheaper AI models make more mistakes than big, expensive ones. But here, the "small" model (Claude Haiku) actually made fewer fake names than its "big brother" (Claude Sonnet). It seems the small model was trained to be extra careful with instructions.

4. Why Did the Problem Shrink?

The author suggests three reasons why the AI is slightly better now:

Leveling the Playing Field: The "open-source" models (free to use) have gotten so good that they are now just as smart as the "commercial" models (paid), so the gap between them closed.
Better Training: The companies feeding the AI data seem to have cleaned up their "cookbooks" (training data) to remove more fake ingredient names.
Standardized Training: All the big AI companies are using similar teaching methods now, so they all make similar (slightly better) mistakes.

The Bottom Line

The AI chefs have cleaned up their act a little bit, but they are still inventing fake ingredients often enough to be dangerous. The most worrying part is that they are all inventing the same fake ingredients.

What the paper does NOT say:

It does not say this is a solved problem.
It does not say you should stop using AI.
It does not claim that all AI models are bad (they only tested the top 5 "frontier" models; smaller, older models might still be much worse).

The author's main message is: The range of errors has shrunk, but the threat remains. Programmers and security teams need to be aware that even the smartest AIs today can still lead you to a fake, dangerous download.

Technical Summary: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort

Problem Statement

The paper addresses the security vulnerability known as slopsquatting, a supply-chain attack vector where adversaries register malicious packages on PyPI or npm under names that Large Language Models (LLMs) hallucinate. When developers trust LLM-generated code containing pip install or npm install directives for non-existent packages, they inadvertently install these malicious artifacts.

While Spracklen et al. (USENIX Security '25) established the existence of this threat in 2024, reporting hallucination rates ranging from 5.2% (commercial models) to 21.7% (open-source models), it remained an open empirical question whether this phenomenon had evolved with the rapid advancement of frontier models released between late 2025 and early 2026. Specifically, the authors sought to determine if the hallucination rates had decreased, if the inter-model variance had narrowed, and if new, model-agnostic attack surfaces had emerged.

Methodology

The study is a faithful replication of Spracklen et al.'s methodology applied to a new cohort of five frontier code-capable LLMs released between October 2025 and March 2026:

Claude Sonnet 4.6 (Anthropic)
Claude Haiku 4.5 (Anthropic)
GPT-5.4-mini (OpenAI)
Gemini 2.5 Pro (Google)
DeepSeek V3.2 (DeepSeek)

Experimental Design:

Prompt Corpus: The authors utilized the exact prompt datasets from the Spracklen artifact (576,000 total prompts across 16 models in the original study), comprising 20,163 Stack Overflow questions and 19,806 LLM-synthesized questions, split evenly between Python and JavaScript.
Generation: A total of 199,845 code samples were generated (approx. 39,969 per model).
Extraction & Validation: Package references were extracted using regex-based heuristics matching pip install, npm install, and import statements. Extracted names were validated against master lists of existing packages for PyPI (500,565 names) and npm (~3 million names) as of April 28, 2026.
Statistical Analysis: Hallucination rates were calculated as the ratio of non-resolving references to total references. Statistical significance was tested using Pearson $\chi^2$ statistics with Holm–Bonferroni correction for pairwise comparisons, alongside Jaccard similarity metrics to measure overlap in hallucinated names.

Key Contributions

Replication on Frontier Models: A comprehensive measurement of package hallucination rates across five state-of-the-art models, generating a new baseline for 2026.
Identification of Range Compression: Documentation of a significant narrowing in the inter-model hallucination spread compared to 2024 data.
Discovery of Universal Hallucinations: The identification of a set of 127 package names (109 on PyPI, 18 on npm) that are hallucinated identically by all five evaluated models, constituting a model-agnostic attack surface.
Observation of Anomalies:
- A reversal of the Python/JavaScript hallucination asymmetry (Python rates are now higher).
- An inversion within the Anthropic family where the smaller model (Haiku 4.5) hallucinates less than the larger model (Sonnet 4.6).
- A high Jaccard similarity (0.343) between DeepSeek V3.2 and GPT-5.4-mini, suggesting shared training data origins or convergent error patterns.
Open Science Artifact: Release of replication code, validation logs, and analysis scripts, with a verified-researcher access policy for the full hallucination corpus.

Results

Hallucination Rates and Range Compression

The study found that hallucination rates across the 2026 cohort range from 4.62% (Claude Haiku 4.5) to 6.10% (GPT-5.4-mini).

Compression: This represents an 11-fold narrowing of the inter-model spread compared to Spracklen's 2024 findings (5.2%–21.7%).
Cause: The compression is attributed to the closing gap between open-weight and commercial models (e.g., DeepSeek V3.2 is now competitive with commercial leaders) and the saturation of training data curation regarding package references.
Persistence: Despite the compression, the threat remains economically viable for adversaries, as even a 4.62% rate yields hundreds of unique hallucinated names per model.

Universal Hallucination Set

A critical finding is the existence of 127 package names hallucinated by all five models.

Significance: This creates a "model-agnostic" attack surface. An attacker registering a single malicious package (e.g., opentelemetry or @ember/service) can target users of any of the five major providers simultaneously.
Mechanism: The authors suggest these universal errors stem from shared training data substrings (e.g., documentation misusing names) or systematic overgeneralization of namespace conventions (e.g., treating internal subpackages as installable targets).

Specific Anomalies

Language Asymmetry: Contrary to 2024 findings where JavaScript was "noisier," all five 2026 models exhibited higher hallucination rates for Python (+2.73 to +4.13 percentage points higher than JavaScript). The authors hypothesize this is due to Python's more heterogeneous naming conventions (snake_case, dashes, dots) compared to JavaScript's flatter structure.
Anthropic Inversion: Within the Anthropic family, Claude Haiku 4.5 (4.62%) hallucinated significantly less than Claude Sonnet 4.6 (5.41%). This contradicts the typical pattern where smaller models hallucinate more. The authors attribute this to Haiku 4.5's default "extended-thinking" capability and specific post-training emphasis on instruction fidelity.
DeepSeek/OpenAI Convergence: DeepSeek V3.2 and GPT-5.4-mini showed the highest pairwise Jaccard similarity (0.343), suggesting shared biases or training data origins.

Significance and Claims

The paper concludes that while the range of hallucination rates has shrunk, the threat has not been retired.

Economic Viability: At 4–7%, the slopsquatting attack remains highly profitable for adversaries due to the zero-cost nature of package registration.
Methodological Shift: The authors argue that single-model studies are insufficient. The existence of a universal hallucination set means that the total attack surface is underestimated if only one model is evaluated. Cross-cohort intersection analysis should become a standard metric in future security research.
Defense Implications: The findings highlight that safety post-training and model scaling have reduced variance but have not eliminated the fundamental issue of models converging on specific, incorrect package names. The authors emphasize that the "frontier" has compressed, but lower-tier open-source models may still exhibit the high rates observed in 2024.

The study maintains a modest tone regarding its claims, noting limitations such as the potential for training data leakage (since the prompt corpus was released in 2025) and the exclusion of agentic configurations where retrieval mechanisms might mitigate hallucinations. The primary contribution is the empirical evidence that the slopsquatting threat persists and has evolved into a multi-provider vulnerability.

The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort