Unraveling Syntax: How Language Models Learn Context-Free Grammars

The Big Picture: How Do AI "Learn" Language?

Imagine you are teaching a robot to speak. You know the robot is getting better at talking, but you don't know how it's learning. Is it memorizing every sentence? Is it building a mental map of grammar rules? Or is it just guessing the next word based on patterns?

This paper asks a specific question: Does the AI learn the "easy parts" of a language first, and then move on to the "hard parts," just like a human child?

To find out, the researchers didn't use real human language (which is messy and complicated). Instead, they built a "toy language" using Context-Free Grammars (CFGs). Think of a CFG as a strict set of Lego instructions. You have a box of blocks (words) and a rulebook (grammar) that tells you exactly how to snap them together to build a tower (a sentence).

The Core Concept: "Sub-grammars" (The Lego Sets)

The authors realized that any complex set of Lego instructions can be broken down into smaller, simpler sets. They call these "sub-grammars."

The Analogy: Imagine a giant instruction manual for building a castle.
- The Whole Grammar: The instructions for the entire castle.
- The Sub-grammars: The specific instructions for building just the towers, just the walls, or just the drawbridge.
- Inner Sub-grammar: A tower that is built inside the castle instructions.
- Outer Sub-grammar: A simplified version of the castle instructions where you only build towers and ignore the walls.

The researchers wanted to see if the AI learns the "towers" and "walls" separately, or if it learns them all at once.

The Big Discovery: The "Parallel Learner"

The Expectation:
Most people think learning happens in steps, like climbing a ladder. You master the bottom rung (simple words) before moving to the top rung (complex sentences). This is how human children learn; they babble simple sounds before forming complex sentences.

The Reality (The Surprise):
The researchers found that small AI models (Transformers) do NOT climb the ladder. Instead, they learn everything in parallel.

The Metaphor: Imagine a student taking a math test.
- A Human Child: Solves the easy addition problems first, gets confident, and then tackles the hard algebra.
- The AI: Looks at the whole test and tries to solve the addition, the algebra, and the geometry problems all at the same time. It doesn't care about the difficulty order; it just attacks every part of the "sub-grammar" simultaneously.

The paper proves mathematically that the "error" (how wrong the AI is) is just the sum of the errors on each little sub-part. Because the math works this way, the AI is free to fix all its mistakes at once.

The "Recursion" Problem: The Infinite Hallway

The paper also looked at recursion. In grammar, this is when a rule refers back to itself.

Example: "The cat that the dog chased ran away." (The sentence is inside the sentence).
The AI's Struggle: The researchers found that while the AI is great at short sentences, it gets confused when the sentence gets deep.
The Analogy: Imagine a hallway with mirrors on both ends.
- If you look down a hallway with 2 mirrors, you see a reflection.
- If you look down a hallway with 10 mirrors, you see a long tunnel.
- If you look down a hallway with 100 mirrors, the AI gets dizzy. It can handle a short tunnel (shallow recursion), but if the tunnel goes too deep, it loses track of where it started.
- Crucial Finding: The AI struggles with depth, not length. It can handle a very long sentence if the structure is simple, but it fails at a short sentence if the structure is deeply nested (like a Russian nesting doll inside a doll inside a doll).

The "Pre-training" Experiment: The Training Wheels

The researchers asked: "If we teach the AI just the 'towers' first, will it learn the whole 'castle' faster?" This is called Curriculum Learning (learning from easy to hard).

The Result:
- For tiny AI models, yes! Teaching them the sub-parts first acted like "training wheels." It helped them learn the whole thing better.
- For bigger AI models, it didn't matter much. They were smart enough to learn the whole castle without the training wheels.
- The Hidden Benefit: Even when it didn't make the AI smarter, pre-training changed how the AI thought. It made the AI's internal "brain" organize the information in a way that matched the grammar's structure perfectly. It was like organizing a messy closet by color before putting the clothes away; the clothes were the same, but the system was much more logical.

Summary: What Does This Mean for Us?

AI isn't a child: Unlike humans, who learn step-by-step, small AI models learn all the rules of a language system simultaneously.
The "Depth" Limit: Even the smartest AI models today struggle with things that go "deep" (like complex nested sentences), even if they are short. They are great at long, flat lists but bad at deep, nested structures.
Teaching Strategy: If you are training a very small AI, teaching it the simple parts first helps. But for big AIs, they are so powerful they can figure it out on their own, though teaching them the structure first helps them "think" more logically.

In a nutshell: The paper shows us that AI is a "parallel processor" that learns the whole puzzle at once, but it still gets dizzy if the puzzle pieces are stacked too deep on top of each other.

1. Problem Statement

While Large Language Models (LLMs) achieve impressive performance, the dynamics of how they acquire language and internalize grammatical structures remain poorly understood. Specifically, it is unclear whether models learn complex syntax by mastering simpler substructures sequentially (as human children do) or in parallel. Furthermore, existing research on Context-Free Grammars (CFGs) has largely treated them as monolithic objects, ignoring their inherent substructure (i.e., "subgrammars"). This paper aims to bridge the gap between formal language theory and neural language modeling by analyzing the learning dynamics of CFGs through the lens of their subgrammars.

2. Methodology

The authors employ a hybrid approach combining theoretical derivation and empirical experimentation:

Theoretical Framework:
- Definitions: They formally define two types of subgrammars:
  - Inner Subgrammars: Correspond to subtrees of CFG derivations (generated by a specific non-terminal).
  - Outer Subgrammars: Simplified versions of the grammar generated by a subset of production rules.
- Loss Decomposition: They derive fundamental theorems relating the Language Modeling loss (KL Divergence) to the subgrammar structure.
- Recursion Analysis: They introduce the concept of "expected recursion" to analyze how the loss scales with the depth of recursive rules.
Empirical Experiments:
- Models: Small Transformer models (2-layer and 4-layer) trained on synthetic PCFGs (Probabilistic CFGs).
- Tasks:
  1. Parallel Learning Check: Monitoring loss across different subgrammars during training to see if they are learned simultaneously.
  2. Curriculum Learning: Pretraining on a specific subgrammar before training on the full grammar to test if it improves final performance or internal representations.
  3. Generalization Tests: Evaluating model performance on sequences with increasing recursion depth versus increasing sequence length.
- Analysis Tools: Centered Kernel Alignment (CKA) to measure similarity between internal representations of models trained from scratch vs. pre-trained models.

3. Key Contributions

A. Theoretical: Loss Decomposition Theorems

The paper establishes that the KL divergence (language modeling loss) of a PCFG recurses linearly over its top-level subgrammars.

Theorem 4.3: The total loss $DKL(P_G \parallel Q_\theta)$ is the sum of the losses of its top-level subgrammars plus the loss of the constant strings connecting them.
Corollary 4.5: Under the assumption of "context insensitivity" (the model predicts a subgrammar similarly regardless of the surrounding context), the total loss decomposes into a weighted sum of the KL divergences of the individual subgrammars.
Theorem 4.6 (Expected Recursion): The authors prove that the KL divergence is inversely proportional to $(1 - E[R])$ , where $E[R]$ is the expected number of times the start symbol recurses. If $E[R] \geq 1$ , the divergence becomes unbounded. This mathematically explains why deep recursion is inherently difficult for language models.

B. Theoretical: Parallel Learning Condition

The paper provides a condition (Corollary 4.7) under which gradient descent leads to parallel learning of subgrammars. If the gradient update for one subgrammar does not hinder performance on others (an independence condition), the model learns all substructures simultaneously.

C. Empirical: Learning Dynamics

Parallel vs. Sequential Learning: Contrary to human language acquisition (where children master simple structures first), small Transformers learn all subgrammars in parallel. The loss decreases simultaneously across all sub-structures.
Curriculum Learning & Pretraining:
- Pretraining on a subgrammar acts as a strong inductive bias.
- For tiny models, pretraining on a subgrammar improves final performance (lower loss).
- For larger models, pretraining does not necessarily lower the final loss but significantly improves the internal alignment of representations. Pretrained models cluster subgrammar sequences and non-subgrammar sequences more distinctly, reflecting the grammar's substructure better than models trained from scratch.
Robustness: The position of the subgrammar (prefix, infix, or suffix) does not affect the model's ability to retain knowledge of it after pretraining.

D. Empirical: The Depth vs. Length Limit

Models struggle significantly with recursion depth, not sequence length.
When tested on nested parentheses or arithmetic expressions:
- Performance remains high for long sequences with shallow depth.
- Performance degrades rapidly as the depth of recursion increases, even for state-of-the-art models (tested on GPT-5.1 Instant).
This suggests a fundamental limitation in the static representations of current LLMs regarding deep hierarchical dependencies.

4. Results Summary

Aspect	Finding
Loss Dynamics	Loss decomposes linearly over subgrammars; total loss is a sum of subgrammar losses.
Learning Order	Transformers learn subgrammars in parallel, unlike the sequential mastery seen in children.
Pretraining	Improves performance for tiny models; improves internal representation alignment for all model sizes.
Recursion	Loss scales inversely with $(1 - \text{expected recursion})$ . Deep recursion causes "blow-up" in error.
Generalization	Models fail on deep recursion (depth > 7) but succeed on long, non-recursive sequences.
Representations	Pretraining forces the model to segregate subgrammar vs. non-subgrammar sequences in activation space.

5. Significance and Implications

New Lens for Analysis: The paper introduces "subgrammars" as a rigorous mathematical tool to dissect language modeling, moving beyond black-box evaluation to structural analysis.
Understanding LLM Limitations: It provides a theoretical explanation for why LLMs struggle with deep recursion: the loss function itself becomes unbounded as expected recursion approaches 1, creating an optimization barrier that gradient descent cannot easily overcome.
Curriculum Learning Insights: It challenges the intuition that "simple-to-complex" curricula are always necessary for neural nets. While helpful for tiny models, the parallel learning nature of Transformers suggests they do not strictly require sequential mastery of syntax.
Future Directions: The work opens avenues for designing better training curricula, improving model architectures to handle deep recursion, and investigating the "inductive bias" of pretraining to align internal representations with grammatical substructures.

In conclusion, this work fundamentally links the mathematical properties of Context-Free Grammars to the learning dynamics of neural networks, revealing that while models can theoretically decompose language into sub-structures, they face inherent difficulties in mastering deep recursive dependencies, a limitation that persists even in large-scale models.