The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort

This paper replicates and extends Spracklen et al.'s 2025 study on LLM package hallucinations using five 2026 frontier models, revealing that while hallucination rates have significantly decreased and inter-model variance has narrowed, a persistent threat remains characterized by a newly identified set of 127 model-agnostic hallucinated package names and distinct cross-ecosystem and cross-model behavioral patterns.

Original authors: Aleksandr Churilov (Independent Researcher)

Published 2026-05-19✓ Author reviewed
📖 4 min read☕ Coffee break read

Original authors: Aleksandr Churilov (Independent Researcher)

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a chef trying to cook a new recipe. You ask a super-smart, AI-powered sous-chef for help. The sous-chef confidently tells you, "You need to buy SuperSpice-9000 from the grocery store!" You go to the store, but SuperSpice-9000 doesn't exist.

In the world of computer coding, this "grocery store" is a digital warehouse called PyPI (for Python) or npm (for JavaScript). These warehouses hold millions of pre-made code "ingredients" (packages) that programmers can download with a single command.

This paper is a follow-up to a scary story told last year. Back then, researchers found that AI chefs were very bad at naming ingredients. They would invent fake names like "SuperSpice-9000" about 5% to 22% of the time. A sneaky thief could register a malicious package with that fake name, wait for a programmer to ask the AI for it, and then trick the programmer into installing a virus. This is called "slopsquatting."

The author of this paper, an independent researcher, asked: "Has the AI gotten better at this two years later?"

Here is what they found, explained simply:

1. The "Fake Ingredient" Problem Got Smaller, But Didn't Go Away

The researchers tested the five smartest AI coding models available in early 2026 (from companies like Anthropic, OpenAI, Google, and DeepSeek).

  • The Good News: The gap between the "best" AI and the "worst" AI has shrunk dramatically. In 2024, some AIs were terrible (22% fake names) while others were okay (5%). In 2026, they are all roughly the same: they all make up fake names about 4.6% to 6.1% of the time. The "spread" of badness has collapsed.
  • The Bad News: The threat is still very real. Even though the rate dropped, 4–6% is still high enough for a thief to make a profit. If an AI makes a fake name 1 in 20 times, a thief can still register that fake name and wait for thousands of programmers to accidentally download it.

2. The "Universal Fake" Discovery

This is the paper's biggest surprise. The researchers found 127 specific fake names that all five of the top AI models invented.

  • The Analogy: Imagine asking five different expert chefs, "What is the secret ingredient in this soup?" and they all independently say, "It's BlueFlavor-7," even though that ingredient doesn't exist.
  • The Danger: If a thief registers "BlueFlavor-7" once, they can attack users of all five AI companies simultaneously. It's a "universal trap" that doesn't depend on which AI you use.

3. A Few Weird Twists

The paper found some patterns that were the opposite of what we expected:

  • Python vs. JavaScript: In 2024, the AI was worse at naming JavaScript ingredients. In 2026, it's actually worse at naming Python ingredients. The AI seems to be getting confused by the messy naming rules of Python.
  • The "Small" vs. "Big" Brother: Usually, smaller, cheaper AI models make more mistakes than big, expensive ones. But here, the "small" model (Claude Haiku) actually made fewer fake names than its "big brother" (Claude Sonnet). It seems the small model was trained to be extra careful with instructions.

4. Why Did the Problem Shrink?

The author suggests three reasons why the AI is slightly better now:

  1. Leveling the Playing Field: The "open-source" models (free to use) have gotten so good that they are now just as smart as the "commercial" models (paid), so the gap between them closed.
  2. Better Training: The companies feeding the AI data seem to have cleaned up their "cookbooks" (training data) to remove more fake ingredient names.
  3. Standardized Training: All the big AI companies are using similar teaching methods now, so they all make similar (slightly better) mistakes.

The Bottom Line

The AI chefs have cleaned up their act a little bit, but they are still inventing fake ingredients often enough to be dangerous. The most worrying part is that they are all inventing the same fake ingredients.

What the paper does NOT say:

  • It does not say this is a solved problem.
  • It does not say you should stop using AI.
  • It does not claim that all AI models are bad (they only tested the top 5 "frontier" models; smaller, older models might still be much worse).

The author's main message is: The range of errors has shrunk, but the threat remains. Programmers and security teams need to be aware that even the smartest AIs today can still lead you to a fake, dangerous download.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →