Unraveling Syntax: How Language Models Learn Context-Free Grammars

This paper establishes theoretical and empirical evidence that language models learn context-free grammars by decomposing them into parallel "subgrammars" with linearly recursive loss, revealing that while pretraining improves internal structural alignment, models still struggle with deep recursion regardless of scale.

Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio

Published 2026-03-02
📖 5 min read🧠 Deep dive

The Big Picture: How Do AI "Learn" Language?

Imagine you are teaching a robot to speak. You know the robot is getting better at talking, but you don't know how it's learning. Is it memorizing every sentence? Is it building a mental map of grammar rules? Or is it just guessing the next word based on patterns?

This paper asks a specific question: Does the AI learn the "easy parts" of a language first, and then move on to the "hard parts," just like a human child?

To find out, the researchers didn't use real human language (which is messy and complicated). Instead, they built a "toy language" using Context-Free Grammars (CFGs). Think of a CFG as a strict set of Lego instructions. You have a box of blocks (words) and a rulebook (grammar) that tells you exactly how to snap them together to build a tower (a sentence).

The Core Concept: "Sub-grammars" (The Lego Sets)

The authors realized that any complex set of Lego instructions can be broken down into smaller, simpler sets. They call these "sub-grammars."

  • The Analogy: Imagine a giant instruction manual for building a castle.
    • The Whole Grammar: The instructions for the entire castle.
    • The Sub-grammars: The specific instructions for building just the towers, just the walls, or just the drawbridge.
    • Inner Sub-grammar: A tower that is built inside the castle instructions.
    • Outer Sub-grammar: A simplified version of the castle instructions where you only build towers and ignore the walls.

The researchers wanted to see if the AI learns the "towers" and "walls" separately, or if it learns them all at once.

The Big Discovery: The "Parallel Learner"

The Expectation:
Most people think learning happens in steps, like climbing a ladder. You master the bottom rung (simple words) before moving to the top rung (complex sentences). This is how human children learn; they babble simple sounds before forming complex sentences.

The Reality (The Surprise):
The researchers found that small AI models (Transformers) do NOT climb the ladder. Instead, they learn everything in parallel.

  • The Metaphor: Imagine a student taking a math test.
    • A Human Child: Solves the easy addition problems first, gets confident, and then tackles the hard algebra.
    • The AI: Looks at the whole test and tries to solve the addition, the algebra, and the geometry problems all at the same time. It doesn't care about the difficulty order; it just attacks every part of the "sub-grammar" simultaneously.

The paper proves mathematically that the "error" (how wrong the AI is) is just the sum of the errors on each little sub-part. Because the math works this way, the AI is free to fix all its mistakes at once.

The "Recursion" Problem: The Infinite Hallway

The paper also looked at recursion. In grammar, this is when a rule refers back to itself.

  • Example: "The cat that the dog chased ran away." (The sentence is inside the sentence).

  • The AI's Struggle: The researchers found that while the AI is great at short sentences, it gets confused when the sentence gets deep.

  • The Analogy: Imagine a hallway with mirrors on both ends.

    • If you look down a hallway with 2 mirrors, you see a reflection.
    • If you look down a hallway with 10 mirrors, you see a long tunnel.
    • If you look down a hallway with 100 mirrors, the AI gets dizzy. It can handle a short tunnel (shallow recursion), but if the tunnel goes too deep, it loses track of where it started.
    • Crucial Finding: The AI struggles with depth, not length. It can handle a very long sentence if the structure is simple, but it fails at a short sentence if the structure is deeply nested (like a Russian nesting doll inside a doll inside a doll).

The "Pre-training" Experiment: The Training Wheels

The researchers asked: "If we teach the AI just the 'towers' first, will it learn the whole 'castle' faster?" This is called Curriculum Learning (learning from easy to hard).

  • The Result:
    • For tiny AI models, yes! Teaching them the sub-parts first acted like "training wheels." It helped them learn the whole thing better.
    • For bigger AI models, it didn't matter much. They were smart enough to learn the whole castle without the training wheels.
    • The Hidden Benefit: Even when it didn't make the AI smarter, pre-training changed how the AI thought. It made the AI's internal "brain" organize the information in a way that matched the grammar's structure perfectly. It was like organizing a messy closet by color before putting the clothes away; the clothes were the same, but the system was much more logical.

Summary: What Does This Mean for Us?

  1. AI isn't a child: Unlike humans, who learn step-by-step, small AI models learn all the rules of a language system simultaneously.
  2. The "Depth" Limit: Even the smartest AI models today struggle with things that go "deep" (like complex nested sentences), even if they are short. They are great at long, flat lists but bad at deep, nested structures.
  3. Teaching Strategy: If you are training a very small AI, teaching it the simple parts first helps. But for big AIs, they are so powerful they can figure it out on their own, though teaching them the structure first helps them "think" more logically.

In a nutshell: The paper shows us that AI is a "parallel processor" that learns the whole puzzle at once, but it still gets dizzy if the puzzle pieces are stacked too deep on top of each other.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →