How Large Language Models Get Stuck: Early structure with persistent errors

This paper investigates how Large Language Models trained on the BabyLM dataset often fail to learn specific grammatical rules because early erroneous biases, driven by misleading bigram statistics, become entrenched and persist throughout training, hindering efficient learning.

Alokesh Manna, William Snyder, Whitney Tabor

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are teaching a very bright, but incredibly young, child how to speak. You want them to learn the complex rules of grammar, not just by memorizing words, but by understanding the deep structure of sentences.

This paper is like a detailed diary of that teaching process. The researchers watched a Large Language Model (LLM)—a type of AI that predicts the next word in a sentence—as it learned from a "BabyLM" dataset (a smaller, more manageable collection of 100 million words, designed to be more like a child's early exposure to language than the massive libraries used by advanced AIs).

Here is the story of what they found, explained simply:

1. The "Critical Window" of Learning

Think of the AI's training like a construction crew building a skyscraper.

  • The Early Days: In the very beginning, the crew is laying the foundation. They are figuring out the basic shape of the building.
  • The Discovery: The researchers found that the AI makes a crucial decision very early on (around the 5,000th step of its training). At this moment, it decides how to handle specific types of grammar rules.
  • The Problem: For about one-third of the grammar rules, the AI gets it wrong right from the start. It builds a "wrong foundation" for these specific rules. Once that foundation is set, the AI keeps reinforcing that mistake for the rest of its training. It gets "stuck" in an error.

2. The "Bigram Trap" (The Local vs. The Global)

Why does the AI get stuck? The authors propose a theory called the "Bigram Hypothesis."

Imagine the AI is a tourist in a foreign city.

  • The "Bigram" Habit: At first, the tourist only pays attention to the two words right next to each other. They think, "Oh, 'go' is usually followed by 'to', so 'go to' must be a good phrase."
  • The Trap: Sometimes, the two words next to each other look great, but the whole sentence is nonsense.
    • Example: "Patrick is about to talk to." (This sounds smooth locally because "about" and "to" go together often).
    • The Reality: The sentence is actually broken. The correct version is "Patrick is irritating to talk to."
  • The Mistake: Because "about" is a very common word that appears often before "to," the AI's "local map" (the bigram) tells it that the broken sentence is actually the good one. Because the AI learns this early, it builds a habit of choosing the "locally smooth" but "globally wrong" sentence. It's like a student who memorizes that "I seen" sounds better than "I saw" because they hear it often in casual speech, and they never unlearn it.

3. The Two Types of Learners

The researchers sorted the 67 different grammar tests into three groups based on how the AI learned them:

  • The "Correct Early" Group (The Fast Learners): For about half the tests, the AI figured out the rule immediately. It saw the pattern, built the right foundation, and never looked back. These were usually simple rules where the local words matched the global rule (e.g., "The cat" vs. "The cats").
  • The "Stuck Early" Group (The Misled Learners): For about a third of the tests, the AI saw the "local trap" (the bigram) and decided that was the rule. It built a wrong foundation and kept making the same mistake, even after seeing thousands of examples. This includes tricky rules like "Island Constraints" (rules about where you can move words around in a sentence).
  • The "Late Bloomers" (The Slow Learners): A small group of tests where the AI was confused at first, but eventually figured it out later in training.

4. Why This Matters

The paper argues that we shouldn't just keep training these AIs longer and longer, hoping they eventually "get it." If they get stuck on a wrong rule in the first 15% of their training, they might never unlearn it, no matter how much data you feed them.

The Solution?
The authors suggest we need to change how we teach them during that critical early window. Instead of just letting them learn from raw data (where the "bigram traps" are everywhere), we might need to intervene early to steer them away from those local traps and toward the deeper, global rules of language.

The Takeaway

Large Language Models aren't just "stupid" when they fail; they are efficient learners that got tricked early on. They learned a shortcut (looking at just two words) that worked for simple things but led them astray for complex things. To make them smarter, we need to catch them before they build those wrong shortcuts, not just after they've finished the whole course.