Experimental evidence of progressive ChatGPT models self-convergence

The Big Idea: AI is Getting "Stuck in a Loop"

Imagine a group of chefs (the AI models) who are famous for writing recipes. For years, they learned by reading millions of cookbooks written by real humans. They were creative, varied, and could describe a "chocolate cake" in a thousand different delicious ways.

However, the internet has changed. Now, instead of just reading human cookbooks, these chefs are starting to read recipes written by other chefs who were trained on the internet.

This paper argues that we are seeing a phenomenon called "Model Self-Convergence." In plain English: The AI models are starting to sound more and more like each other, losing their ability to be creative or unique, because they are eating their own tail.

The Experiment: The "Paraphrase Test"

To prove this, the researchers set up a simple experiment:

The Source Material: They took 443 summaries of classic books (like Bleak House) written by real humans. These are the "original recipes."
The Chefs: They asked different versions of ChatGPT (from the old 2022 version to the newest 2025 versions) to rewrite these summaries.
The Twist: They asked the AI to do this twice:
- Temperature 0 (The Robot Mode): The AI tries to be as predictable and logical as possible.
- Temperature 1 (The Creative Mode): The AI is told to be random, creative, and take risks.
The Measurement: They used a special ruler (called "Similarity Percentage Ratio") to measure how similar the AI's rewrites were to each other.

The Findings: The "Echo Chamber" Effect

Here is what they found, using a simple analogy:

1. The Old AI (2022-2023): The Improvising Jazz Musician
When the older models were asked to be creative (Temperature 1), they were like jazz musicians. If you asked them to play a song, they would improvise. One time they might play it fast, another time slow, with different notes. Even if they were rewriting the same story, the results were very different from each other. They had diversity.

2. The New AI (2024-2025): The Broken Record
When the newest models were asked to be creative, they sounded like a broken record.

The Problem: Even when told to be random, the new models kept using the exact same phrases, sentence structures, and word choices.
The Result: If you asked the newest AI to rewrite a story five times, the five results were almost identical to each other. They had lost their "stochasticity" (their ability to be random).

3. The "Self-Convergence" Phenomenon
The paper calls this Model Self-Convergence.

Analogy: Imagine a room full of mirrors facing each other. If you stand in the middle, you see infinite reflections of yourself. The newer AI models are like those mirrors. They are trained on data that is increasingly filled with text they themselves generated.
Because the internet is now flooded with AI-written text (students using AI for homework, bloggers using AI for articles), the AI is learning from its own output. It's like a student who only studies the answers written by previous students, rather than the original textbook. Eventually, they all start giving the exact same answer.

Why Does This Matter?

The researchers found something scary: The newer models are getting worse at being creative, not better.

The "Gibberish" vs. "Blandness" Distinction: You might have heard of "Model Collapse," where AI starts spitting out total nonsense (gibberish). This paper says we aren't quite there yet. Instead, the AI is becoming bland. It's not making mistakes; it's just becoming a boring echo chamber.
The Loss of Innovation: If an AI can't generate diverse ideas, it can't be truly innovative. It will just recycle the same patterns it saw on the internet.
The Vicious Cycle: The more we use AI, the more AI text floods the internet. The more AI text floods the internet, the more future AIs get trained on it. The more they get trained on it, the less unique they become.

The Conclusion: A Warning for the Future

The paper concludes that unless we can find a way to train AI only on fresh, human-created data (which is getting harder to find), we risk a future where our digital assistants all sound exactly the same.

The Metaphor:
Think of the internet as a giant library.

Before: The library was full of books written by humans. The AI was a student reading all of them, learning to write in many different styles.
Now: The library is being filled with photocopies of the AI's own notes. The student is now reading only those photocopies.
The Result: The student stops learning new styles and starts repeating the same few sentences over and over, thinking they are the only words that exist.

This paper is a wake-up call: We need to keep the "human voice" in the training data, or the AI will eventually lose its voice entirely.

1. Problem Definition

The paper addresses the phenomenon of Model Self-Convergence, a distinct variation of the broader "Model Collapse" problem.

Context: Large Language Models (LLMs) are increasingly trained on internet data that contains a growing proportion of AI-generated text (synthetic data) due to the widespread adoption of LLMs by humans (e.g., for writing emails, reports, and student assignments).
The Issue: While "Model Collapse" (Shumailov et al., 2024) refers to models trained recursively on their own outputs eventually producing gibberish, this paper investigates a subtler degradation: Self-Convergence. This occurs when newer LLM versions, trained on data contaminated by previous AI generations, lose their ability to generate diverse outputs.
Hypothesis: As the training datasets for newer ChatGPT versions (e.g., GPT-4o, GPT-5) incorporate more synthetic data from the internet, the models' stochasticity (randomness) decreases. Consequently, even when prompted to generate diverse paraphrases with high temperature settings ( $T=1$ ), newer models produce outputs that are structurally identical to each other and to older models, rather than diverging.

2. Methodology

The authors designed a controlled experiment to measure the Similarity Percentage Ratio (SPR) between paraphrases generated by different ChatGPT versions.

Dataset:
- Source: 443 chapter summaries from CliffNotes and SparkNotes (human-written, pre-LLM era, verified as entirely human-generated).
- Selection Criteria: Summaries between 100 and 2,000 words to avoid overfitting on short texts or truncation on long texts.
- Control: Two distinct human sources covering the same literary works to ensure the original texts were not identical.
Models Tested: Seven ChatGPT versions spanning different knowledge cutoff dates:
- GPT-3.5 Turbo (01/09/2021)
- GPT-4 Turbo (01/12/2023)
- GPT-4o, GPT-4.1 (01/06/2024)
- GPT-5, GPT-5.1, GPT-5.2 (Knowledge cutoffs ranging 09/2024 to 08/2025).
Experimental Procedure:
1. Prompting: Used a standardized prompt: "Answer ONLY the question, no extra context. Please paraphrase the following text: ".
2. Temperature Settings: Generated paraphrases at Temperature 0 (deterministic) and Temperature 1 (stochastic).
  - Note: GPT-5 does not allow temperature modification (defaults to 1).
3. Volume: 3 paraphrases per text at $T=0$ ; 5 paraphrases per text at $T=1$ .
4. Metric Calculation:
  - Used the All Repeated Patterns Detection (ARPaD) algorithm.
  - Calculated SPR: The percentage of non-overlapping words in common patterns of specific lengths ( $l$ ) relative to the total text length.
  - Analyzed pattern lengths from 3 to 20 words.
5. Comparison: Measured SPR between paraphrases generated by the same model (intra-model) and compared the trends across different model versions.

3. Key Results

The study revealed a counter-intuitive trend where newer models exhibited less diversity than older ones, particularly under stochastic conditions.

Temperature 0 (Deterministic):
- Older models (3.5 Turbo, 4 Turbo) showed high SPRs (low diversity) as expected.
- Newer models (GPT-5.2) surprisingly showed lower SPRs (higher diversity) than older models, likely due to internal fine-tuning to force variety in deterministic modes.
Temperature 1 (Stochastic):
- The Critical Finding: Newer models (GPT-5 family) exhibited significantly higher SPRs than older models.
- For short patterns (length 3), SPRs for newer models ranged from 55% to 70% (very high for a stochastic setting).
- For long patterns (length 15–20), SPRs for newer models remained at 10%–20%, whereas older models dropped to <5%.
- Implication: Newer models fail to "improvise" even when explicitly set to be random. They repeatedly reuse long token sequences, effectively "copy-pasting" patterns from their training data (which includes prior AI outputs).
Convergence of Spreads ( $\delta_{SPR}$ ):
- The difference between $T=0$ and $T=1$ results ( $\delta_{SPR}$ ) is a measure of a model's ability to distinguish between deterministic and stochastic modes.
- Older models showed a wide spread (large difference between $T=0$ and $T=1$ ).
- Newer models (GPT-5.1, 5.2) showed a rapid convergence of this spread, indicating they are losing the ability to differentiate between deterministic and stochastic generation.

4. Key Contributions

Definition of Model Self-Convergence: The paper formally defines "Model Self-Convergence" as the phenomenon where LLMs trained on internet-contaminated data (containing AI-generated text) lose output diversity and converge toward identical structures, distinct from the "gibberish" output of recursive Model Collapse.
Longitudinal Empirical Evidence: This is the first study to systematically track the degradation of output diversity across successive ChatGPT versions over time, linking it to the increasing presence of synthetic data in training sets.
Methodological Innovation: Application of the ARPaD algorithm and SPR metric to detect subtle structural similarities in paraphrasing tasks, specifically isolating the impact of the "Temperature" parameter.
Identification of the "Vicious Cycle": The paper highlights that the only way to avoid self-convergence is to train on pre-LLM human data, which implies a future where LLMs cannot learn from new human knowledge without re-introducing the cycle of contamination.

5. Significance and Implications

Loss of Innovation: If LLMs converge to identical outputs regardless of stochastic settings, they lose their ability to generate novel ideas, creative variations, or diverse perspectives.
Obsolescence of Human Knowledge: As AI-generated content dominates the internet, future LLMs will increasingly prioritize AI-generated patterns over authentic human knowledge, potentially rendering them obsolete for learning new human concepts.
Safety and Reliability: The inability to differentiate between deterministic and stochastic modes suggests a fundamental breakdown in model control mechanisms, which could have implications for safety and predictability in AI applications.
Future Research Direction: The authors suggest that mitigating this requires strict filtering of training data to remove AI-generated content or a complete reset to pre-LLM datasets, a challenge that becomes increasingly difficult as the "AI contamination" of the internet accelerates.

In conclusion, the paper provides experimental proof that the "pollution" of the internet with AI-generated text is causing LLMs to enter a state of self-convergence, where newer models become less diverse and more repetitive, threatening the long-term viability and utility of generative AI.

Experimental evidence of progressive ChatGPT models self-convergence

The Big Idea: AI is Getting "Stuck in a Loop"

The Experiment: The "Paraphrase Test"

The Findings: The "Echo Chamber" Effect

Why Does This Matter?

The Conclusion: A Warning for the Future

1. Problem Definition

2. Methodology

3. Key Results

4. Key Contributions

5. Significance and Implications

More like this

Diffusion Language Models Know the Answer Before Decoding

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá