Drift and selection in LLM text ecosystems

Imagine the internet as a giant, ever-growing library. For a long time, this library was filled with books written by humans. But now, AI models are writing new books, and humans are publishing them. Soon, these AI-written books will be used to train the next generation of AI.

This creates a loop: AI writes, humans publish, new AI learns from that, and the cycle repeats.

The paper you shared asks a scary but fascinating question: What happens to the library if we keep recycling the same books over and over? Does the library get smarter, or does it start to lose its mind?

The author, Søren Riis, uses a mathematical model to show that two main forces are at play in this loop: Drift and Selection.

Here is the story of the library, explained simply.

1. The Force of Drift: The "Fading Echo"

The Analogy: Imagine a game of "Telephone" (or "Broken Telephone") played in a very large room. One person whispers a story to the next, who whispers it to the next, and so on.

In the real world, if you whisper a story, you might forget a rare word or a specific detail. If you pass that story on, the next person forgets a little more. Eventually, the story becomes very generic. The weird, unique, and rare details disappear first.

In the AI Library:

The Problem: AI models are trained on a finite amount of text. When they generate new text, they are essentially "whispering" what they learned.
The Result: Rare words and complex, unique phrases are the first to vanish. They are like the rare coins in a jar; if you keep scooping out a handful and putting them back, eventually, you might accidentally scoop out all the rare coins and leave only the common ones.
The Outcome: The library becomes "shallow." It still has words, but it loses the deep, complex, and rare structures that make human language rich. The AI starts repeating the same safe, common patterns because the "rare" options have drifted away into nothingness.

2. The Force of Selection: The "Editor's Filter"

The Analogy: Now, imagine the library has a strict librarian (the "Editor").

Scenario A (Descriptive Selection): The librarian just copies whatever is written and puts it on the shelf, regardless of quality. This is like the "Drift" scenario above—the library slowly becomes boring and repetitive.
Scenario B (Normative Selection): The librarian is picky. They only put books on the shelf if they are correct, novel, or high-quality. They throw away the boring, repetitive, or wrong answers.

In the AI Library:

The Good News: If the AI is forced to pass a "test" (like a math check or a code verification) before its text is published, it keeps the deep structure alive. The "rare" and "complex" ideas survive because they are the only ones that pass the test.
The Bad News: If the librarian just copies what is popular (the "status quo"), the library collapses into a shallow state where no amount of "thinking ahead" helps.

The Big Discovery: Two Different Futures

The paper proves mathematically that the future of AI text depends entirely on how we filter what gets published.

Future 1: The "Model Collapse" (The Shallow Pool)

If we just let AI generate text and feed it back to itself without strict quality checks, the library becomes a shallow pool.

What it looks like: The AI can still write sentences, but they lack depth. It's like a song that only has a simple, repetitive beat.
Why it happens: The "Drift" force wins. Rare ideas die out, and the AI gets stuck in a loop of generating the most statistically probable (and boring) words.
The Catch: Even if you give the AI a longer memory (a bigger brain), it can't help. The information it needs to use that memory has already been deleted from the library.

Future 2: The "Deep Structure" (The Rich Garden)

If we use Normative Selection (checking for truth, logic, or creativity), the library remains a rich garden.

What it looks like: The AI continues to produce complex, deep, and surprising text.
Why it happens: The "Selection" force acts like a gardener, pruning the weeds (bad text) and keeping the rare flowers (complex ideas).
The Result: The AI keeps getting better at "thinking ahead" because the deep structures it needs to learn are preserved in the library.

The "Lookahead" Metaphor

The paper uses a concept called "Lookahead." Imagine you are walking through a maze.

Shallow AI: Takes one step at a time, looking only at the tile right in front of it. It often walks into dead ends.
Deep AI: Looks 5 steps ahead. It sees the dead end and chooses a different path.

The paper shows that if the library is "shallow" (due to Drift), looking 5 steps ahead is useless because the map is broken. But if the library is "deep" (due to good Selection), looking ahead is powerful and keeps the system stable.

Why This Matters for You

This isn't just about math; it's about the future of the internet and AI.

If we are careless: If we just let AI write everything and feed it back to itself without checking for quality, we risk creating a "Model Collapse." The internet could become a hall of mirrors, reflecting only the most generic, repetitive, and shallow ideas. We might lose the ability to learn from complex human thought.
If we are careful: If we build systems that verify facts, check for logic, and reward creativity (Normative Selection), we can sustain a rich, deep, and evolving digital culture. The AI can continue to learn from the best parts of human knowledge.

The Bottom Line

The paper is a warning and a guide. It tells us that recycling AI text is dangerous unless we filter it strictly.

Drift is the natural tendency for things to become simple and boring over time.
Selection is the human (or AI) effort to keep things complex, true, and interesting.

To keep our digital future rich, we must ensure that the "Editor" in the loop is strict enough to stop the library from becoming a shallow pool of repetitive noise.

1. Problem Statement

The paper addresses the phenomenon of recursive data contamination in Large Language Model (LLM) ecosystems. As AI-generated text increasingly enters the public record (the "corpus") and is subsequently used to train future models, a feedback loop is formed.

The Core Issue: Prior work has shown that recursive training on synthetic data can lead to "model collapse," where diversity is lost, and distributions degrade. However, existing studies often treat these effects in isolation or focus on specific failure modes.
The Gap: There is a lack of a unified theoretical framework that separates the neutral forces (random sampling noise) from selective forces (human or algorithmic filtering) to understand exactly how the public text distribution evolves and what future learners inherit.

2. Methodology

The author develops an exactly solvable mathematical framework based on variable-order $n$ -gram agents. This approach treats the text ecosystem as a population-genetic system where tokens are "alleles" and the corpus is the "gene pool."

Key Components of the Framework:

The Recursive Loop:
1. Fit: An agent fits an $n$ -gram model to the current corpus.
2. Generate: The agent generates synthetic text (a new batch of tokens).
3. Replace: A fraction $\alpha$ of the corpus is replaced by the synthetic text.
4. Refit: The process repeats.
Two Distinct Forces:
1. Drift (Neutral): Arises from finite sampling. Even without preference, rare forms are lost because they have a non-zero probability of being sampled zero times in a finite batch. This is modeled as a Wright-Fisher process.
2. Selection (Normative/Descriptive): Arises from publication rules.
  - Descriptive: Agents publish whatever they generate (or what looks locally likely).
  - Normative: Agents filter outputs based on quality, correctness, or novelty (e.g., via lookahead, verification, or human ranking) before publishing.
The "Lookahead" Mechanism: The paper models agents that use $L$ -step lookahead (or $r$ -gram agents where $r = n + L$ ). These agents condition their token selection on the probability that a continuation will survive a verification process (e.g., not hitting a dead end in code or logic).

3. Key Contributions & Theoretical Results

The paper establishes three main theorems characterizing the dynamics of these ecosystems.

Theorem 1: Drift and Fixed Points (Neutral Recursion)

Finite Corpus (Drift): In a finite corpus, unfiltered recursion acts as a Wright-Fisher drift process. Rare tokens behave like rare alleles; their expected frequency remains constant, but their variance increases until they are either lost (extinction) or fixated.
- Result: The probability of a rare token going extinct is $1 - k/M$ (where $k$ is its count and $M$ is corpus size), independent of the replacement rate $\alpha$ . $\alpha$ only controls the speed of this process.
Infinite Corpus Limit (Fixed Points): In the limit of infinite corpus size ( $M \to \infty$ $M \to \infty$ ), the stochastic noise vanishes, and the system converges to a deterministic fixed point.
- Result: The set of all fixed-point distributions forms a convex polytope. The extreme points of this polytope correspond to simple directed cycles in the de Bruijn graph $B(n-1, s)$ . Every stable distribution is a convex combination of these deterministic periodic sequences.

Theorem 2: Fixed Points Under Selection

This theorem distinguishes between two types of publication rules:

Descriptive Publication (Self-Defeating): If agents publish based solely on statistical likelihood (no external quality filter), the system converges to an $n$ -shallow state.
- Definition: A distribution is $n$ -shallow if its $r$ -gram statistics can be perfectly reconstructed from its induced $n$ -gram continuation law.
- Consequence: Lookahead becomes redundant. The system compresses the text, erasing deep structure that requires longer context to predict.
Normative Publication (Self-Sustaining): If agents filter for quality (e.g., correctness, novelty) using a "soft" rule, the system can converge to a deep state.
- Result: The corpus retains structure beyond the $n$ -gram window. The Kullback-Leibler (KL) divergence between the corpus distribution and the rollout from its induced $n$ -gram law is strictly positive.
- Bound: The paper establishes an optimal upper bound on this divergence: $L \log_2 s$ bits, where $L$ is the lookahead horizon and $s$ is the vocabulary size. This bound is tight and achieved by cyclic de Bruijn sequences.

Theorem 3: Cross-Entropy Inheritance

The Inheritance Principle: Later learners trained on the filtered corpus via cross-entropy minimization will recover the public next-token conditional distribution ( $q$ ) generated by the ecosystem, provided their model class is expressive enough.
Implication: The specific architecture of the learner (e.g., transformer vs. n-gram) matters less than the distribution of the data they are trained on. They inherit the "reshaped" conditional distribution created by the drift and selection forces.

4. Experimental Results

The author validates the theory using exact simulations on literary corpora (Arthur Conan Doyle, Jane Austen, Charles Darwin) and synthetic data.

Drift Experiments: Confirmed that vocabulary and high-order trigram retention decline monotonically under recursive resampling. Higher-order structures (trigrams) vanish faster than vocabulary (unigrams).
Selection Experiments (Figure 2 & 8):
- Descriptive Case: The KL divergence between the corpus distribution and the $n$ -gram rollout collapses to zero, confirming the system becomes $n$ -shallow.
- Normative Case: The KL divergence stabilizes at a non-zero plateau (e.g., 2.57 bits in a specific experiment), confirming the preservation of deep structure.
- Entropy: While entropy stabilizes in both cases, only the normative case maintains a gap between the corpus entropy and the induced $n$ -gram entropy.

5. Significance and Implications

Unified Theory: The paper provides the first unified mathematical separation of drift (random loss of diversity) and selection (filtering for quality).
Design of AI Training:
- Artifact Learning: If the goal is to learn finished products (e.g., polished code, correct proofs), normative selection is beneficial as it filters out "dead ends" and stabilizes successful structures.
- Process Learning: If the goal is to learn the search process (e.g., debugging, scientific exploration), descriptive publication (or unfiltered data) is crucial. Normative filtering destroys the "intermediate traces" and failed attempts necessary for learning how to search.
Model Collapse vs. Stabilization: The paper clarifies that "model collapse" (convergence to shallow, repetitive text) is not inevitable. It is the specific outcome of descriptive recursion. Normative recursion can sustain rich, deep structures indefinitely, provided the selection criteria reward complexity.
Theoretical Bridge: By using $n$ -grams as a "tabular" idealization (analogous to tabular Q-learning in RL), the paper offers a rigorous, closed-form analysis of forces that likely operate in complex neural ecosystems, suggesting that the "public conditional" is the true object of inheritance in AI evolution.

In summary, the paper argues that the future of AI text ecosystems depends on how we filter and publish. Unfiltered loops lead to shallow, repetitive collapse, while selective, normative loops can sustain deep, complex structures, provided the selection criteria are robust enough to preserve the "hidden" depth of the data.