N-gram-like Language Models Predict Reading Time Best

Here is an explanation of the paper "N-gram-like Language Models Predict Reading Time Best," translated into simple, everyday language with some creative analogies.

The Big Idea: The "Too Good" Paradox

Imagine you are teaching a robot to read a book. You want the robot to predict how long it takes a human to read a specific word.

For a long time, scientists thought the rule was simple: "The smarter the robot, the better it predicts human reading." They assumed that if you gave a language model (like the AI behind this chat) more data and more brainpower, it would get closer to how humans think.

But recently, researchers noticed something weird. When these AI models became too smart and too good at predicting the next word, they actually started getting worse at predicting how long humans take to read. It's like a student who memorized the entire textbook so perfectly that they forgot how a normal person actually learns and stumbles over new words.

This paper asks: Why does getting "smarter" make the AI worse at mimicking human reading speeds?

The Solution: The "Street Smarts" vs. "Book Smarts" Theory

The authors propose a surprising answer: Humans don't read like super-computers; we read like people relying on simple patterns.

Think of reading a sentence like walking through a crowded city:

The Super-Computer (Modern AI): It looks at the entire city map, the history of the neighborhood, the weather, and the traffic patterns to predict exactly where you will step next. It's incredibly accurate, but it's too complex for how your brain actually works in the moment.
The Human Reader: You mostly look at the last few steps you took. You rely on immediate habits. If you just said "The cat sat on the...", your brain is already screaming "MAT!" because that's the most common pattern you've seen a million times. You aren't doing a deep philosophical analysis of the city; you are just reacting to the immediate, simple pattern.

The paper argues that reading time is driven by these simple, immediate patterns (called n-grams), not by the deep, complex understanding that advanced AIs have.

The Experiments: Testing the Theory

The researchers ran three experiments to prove this, using different tools and datasets. Here is the breakdown:

1. The "Simple Pattern" Test (Experiment 1)

They took simple statistical tools (which just count how often words appear next to each other) and compared them to complex AI models.

The Finding: The simple tools that looked at just the last 1 or 2 words (1-gram and 2-gram) were the best at predicting how fast humans read.
The Twist: As they looked at longer and longer chains of words (3, 4, or 5 words back), the prediction got worse.
The Analogy: It's like guessing what a friend will order for dinner. If you know they usually order "Pizza" after "Friday," you are right 90% of the time. If you try to guess based on their entire life history, the weather, and their mood, you might overthink it and get it wrong.

2. The "Training Journey" Test (Experiment 2)

They watched AI models (specifically the Pythia family) as they were being trained. They checked the models at different stages: when they were "babies" (early training) and when they were "adults" (fully trained).

The Finding: The models were best at predicting human reading times when they were "babies"—specifically, when they were just starting to learn simple word pairs (bigrams) and triplets (trigrams).
The Divergence: As the models kept training and became "super-smart," they stopped mimicking human reading speeds. They started predicting words that were statistically perfect but psychologically unnatural for a human reader.

3. The "Universal Truth" Test (Experiment 3)

They repeated the test with different types of AI models and different reading datasets (including bilingual readers) to make sure the results weren't a fluke.

The Finding: The pattern held up everywhere. Any AI model that acted more like a simple pattern-counter was better at predicting human reading speeds than a model that acted like a complex genius.

Why Does This Matter?

This paper solves a mystery in the world of AI and psychology.

For AI Developers: It tells us that making a model "bigger" and "smarter" doesn't always make it more "human-like." Sometimes, to understand human behavior, you actually need to simplify the model's view of the world.
For Psychologists: It suggests that when we read, our eyes and brains are reacting to local, surface-level statistics (what just happened 1 or 2 words ago) rather than deep, complex context. We are "pattern matchers" first and "meaning makers" second when it comes to eye movements.

The Final Takeaway

Imagine you are trying to predict how a child will react to a magic trick.

If you use a super-computer that analyzes the magician's muscle tension, the lighting, and the history of magic, you might predict the trick perfectly, but you won't predict the child's surprise.
If you use a simple rule ("Kids are always surprised when something disappears"), you might miss the details, but you will perfectly predict the child's reaction time.

The authors found that human reading is the child. We react to the simple, immediate patterns. The most advanced AI models are the super-computers; they are so good at the "deep" stuff that they forget to account for the simple, immediate reactions that actually drive our eyes across the page.

In short: To predict how fast humans read, you don't need a genius AI. You need a model that thinks a little bit more like a simple pattern-recognition machine.

Here is a detailed technical summary of the paper "N-gram-like Language Models Predict Reading Time Best" by James A. Michaelov and Roger P. Levy.

1. Problem Statement

Recent advancements in Natural Language Processing (NLP) have led to the development of powerful transformer-based language models (LMs) with superior next-word prediction capabilities. However, a counter-intuitive phenomenon known as inverse scaling has emerged: as these models become more powerful (larger parameter counts, larger training corpora, lower perplexity), their predicted word probabilities (surprisal) become less correlated with human reading times.

Previous hypotheses for this divergence included:

Subjective vs. Statistical Probability: Humans may rely on "subjective" probabilities influenced by world knowledge, while LMs converge on empirical statistics.
Training Data Mismatch: LMs are trained on "high-quality" text, whereas human language acquisition begins with speech and child-directed speech.
Architectural Differences: Differences in memory capacity and context windows between humans and models.

The authors challenge these explanations, proposing a new hypothesis: Human reading time is primarily sensitive to lower-order $n$ -gram statistics (specifically unigrams and bigrams) rather than the complex, long-range dependencies captured by state-of-the-art transformers. As models become "too good" at next-word prediction, they move away from these simple statistics, thereby losing their predictive power for reading time.

2. Methodology

The authors conducted three experiments to test the hypothesis that the correlation between LM surprisal and reading time is driven by the model's adherence to lower-order $n$ -gram statistics.

Experiment 1: $n$ -gram Scaling Analysis

Data: Six large-scale corpora ranging from 10 billion to 4.6 trillion tokens (OpenWebText, C4, The Pile, Dolma, DCLM, OLMo-Mix).
Task: Calculated surprisal (negative log-probability) for words based on $n$ -grams ( $n=1$ to $5$) using raw counts and Stupid Backoff smoothing.
Reading Time Metric: The Provo Corpus (470 participants, 55 passages), analyzing four eye-tracking measures: First Fixation Duration (FFD), First Pass Duration (FPD), Go-Past Duration (GPD), and Total Duration (TD).
Goal: Determine if the correlation between $n$ -gram surprisal and reading time degrades as corpus size increases (inverse scaling) or if lower-order $n$ -grams maintain high correlation.

Experiment 2: Training Dynamics (Pythia Models)

Models: The Pythia suite (10 autoregressive transformers, 14M to 12B parameters) trained on The Pile (300B tokens).
Task: Tracked the correlation between model surprisal and reading time metrics across training checkpoints (from early to late training).
Comparison: Compared the trajectory of LM-surprisal-to-reading-time correlation against the trajectory of LM-surprisal-to- $n$ -gram-surprisal correlation.
Goal: To see if the peak fit to reading time coincides with the point where the model's predictions most closely resemble lower-order $n$ -gram probabilities.

Experiment 3: Generalization Across Models and Datasets

Models: Expanded beyond Pythia to include Open GPT-2 (trained on OpenWebText) and Gemstone models (varying width/depth, trained on Dolma).
Datasets: Combined Provo Corpus with the Ghent Eye-Tracking Corpus (GECO).
Goal: Verify if the relationship between "n-gram-likeness" and "reading-time prediction" holds across different model architectures, sizes, and datasets.

3. Key Contributions

Identification of the "N-gram Sweet Spot": The paper provides empirical evidence that the best predictors of human reading time are not the most complex models, but those whose predictions align most closely with simple $n$ -gram statistics (specifically unigrams and bigrams).
Explanation of Inverse Scaling: The authors demonstrate that inverse scaling is not a failure of the model's ability to predict words, but a shift in the type of statistics the model learns. As models train further, they move away from the local, low-order statistics that drive eye movements, leading to a divergence from human reading behavior.
Decoupling of Metrics: The study distinguishes between different reading time metrics. It finds that First Pass Duration and Go-Past Duration correlate most strongly with bigram surprisal, while First Fixation Duration and Total Duration show a stronger relationship with trigram surprisal.
Cross-Architecture Validation: The findings hold true across different architectures (Transformers, Mamba, RWKV) and model families, suggesting the phenomenon is fundamental to the nature of reading rather than a specific model artifact.

4. Key Results

Correlation with $n$ -grams: Lower-order $n$ -grams ( $n=1, 2$ ) consistently show the highest correlation with reading time metrics across all corpus sizes. Higher-order $n$ -grams ( $n \geq 3$ ) show progressively lower correlations.
Inverse Scaling in $n$ -grams: Interestingly, for higher-order $n$ -grams, larger corpora led to lower correlations with reading time, suggesting that the "noise" of long-range context in massive datasets may actually hinder the prediction of immediate reading difficulty.
Synchronicity of Training: In Experiment 2, the correlation between Pythia model surprisal and reading time peaked at the same training steps where the model's surprisal was most highly correlated with $n$ -gram surprisal. As training continued and the model diverged from $n$ -gram statistics, its fit to reading time degraded.
Generalization: Experiment 3 confirmed that this pattern is robust. Across Open GPT-2, Gemstone, and Pythia models, the models that were most "n-gram-like" were consistently the best predictors of reading time.
Correlation Coefficients: Table 1 in the paper shows extremely high Pearson correlation coefficients (often $>0.95$ ) between the correlation of (LM $\to$ Reading Time) and the correlation of (LM $\to$ $n$ -gram surprisal).

5. Significance and Implications

Cognitive Mechanism of Reading: The results support the E-Z Reader model of eye movement control, which posits that saccade planning begins once a word's orthographic form is identified. This identification phase relies on local, surface-level statistical patterns (like bigrams) rather than deep, global semantic integration. The "inverse scaling" effect suggests that while larger models learn complex semantics (reflected in neural indices like the N400), they lose sensitivity to the local statistical cues that drive immediate eye movements.
Guidance for Psycholinguistic Modeling: For researchers modeling human reading behavior, the paper advises against simply using the largest, most powerful LMs. Instead, models should be evaluated or constrained to ensure they retain sensitivity to lower-order $n$ -gram statistics.
Re-evaluating "Human-Likeness": The paper challenges the assumption that "better" next-word prediction (lower perplexity) equates to "more human-like" processing. In the context of reading time, being "too good" at prediction (by capturing complex, non-local dependencies) actually makes the model less human-like regarding immediate processing costs.
Future Directions: The authors suggest that the divergence might be due to the smoothing functions used in $n$ -gram models or that eye movements are sensitive to word-level semantic relatedness captured by weaker LMs but lost in pure $n$ -gram counts.

In summary, the paper argues that reading time is a proxy for local statistical processing, and the most effective language models for predicting it are those that have not yet "overfitted" to the complex, long-range structures of language, remaining instead sensitive to the simple $n$ -gram statistics that humans utilize during the initial stages of word recognition.