Manipulating language models' training data to study syntactic constraint learning: the case of English passivization

Imagine you are teaching a robot how to speak English. You give it a massive library of books, articles, and websites to read. The robot is smart; it learns that most sentences follow a pattern. For example, if you can say "The dog chased the cat," the robot learns you can usually flip it around to say "The cat was chased by the dog." This is called the passive voice.

But English is tricky. Sometimes, you can flip the sentence, and sometimes you can't.

Good: "The writer defenestrated the editor" $\rightarrow$ "The editor was defenestrated by the writer." (Weird word, but it works!)
Bad: "The meeting lasted one hour" $\rightarrow$ * "One hour was lasted by the meeting." (This sounds wrong to us.)

The big mystery is: How does a learner (a child or a robot) know which verbs are "flip-able" and which ones aren't? They never get a teacher to say, "Hey, you can't use 'last' in the passive voice." They have to figure it out just by listening.

This paper is like a detective story where the researchers use a neural network (a type of AI) as a test subject to solve this mystery. Here's the breakdown of their investigation using simple analogies:

The Two Suspects: "Frequency" vs. "Meaning"

The researchers had two main theories about how the robot (and humans) figure out these rules:

The "Frequency" Suspect (Entrenchment):
- The Theory: If you hear a word used in the active voice (e.g., "The meeting lasted...") a million times, but you never hear it in the passive voice ("One hour was lasted..."), your brain eventually says, "Okay, I guess that's just not how we do things." It's like a path in a park. If everyone walks the main path, but no one ever walks the side path, eventually the side path disappears into the grass. You assume it's not a real path.
- The Metaphor: Imagine a restaurant. If you order "Steak" every day and the waiter never brings it to the table, but brings "Chicken" every time you ask for "Chicken," you eventually stop ordering "Steak" because you assume the kitchen doesn't make it.
The "Meaning" Suspect (Affectedness):
- The Theory: The passive voice usually happens when something gets changed or gets hurt by an action. If you "break" a vase, the vase is affected. If you "last" an hour, the hour doesn't really get changed or hurt; it just exists. The theory says the robot learns that if a verb doesn't involve "affecting" something, it can't be passive.
- The Metaphor: Think of a paintball game. If you get hit (affected), you are the target. If you just stand there and watch the game (not affected), you aren't the target. The passive voice is like saying, "I was hit." You can't say, "I was watched" in the same way if the "watching" didn't change you.

The Experiment: Cooking with Different Ingredients

To test which suspect was guilty, the researchers didn't just watch the robot; they rewrote the robot's library (its training data) to see how it reacted.

Experiment 1: Does the robot sound like a human?
First, they checked if the robot actually understood English grammar. They asked it to judge sentences.

Result: Yes! The robot's judgments were almost identical to human judgments. It knew that "The cat was chased" is good, but "One hour was lasted" is bad. This proved the robot was learning from the library, not just memorizing.

Experiment 2A: The "Frequency" Test
They took a verb that humans can use in the passive voice (like "drop") and edited the library so that the robot saw "drop" in the active voice 100 times for every 1 time it saw it in the passive voice (mimicking the pattern of "last").

Result: The robot started thinking "drop" was weird in the passive voice, just like "last."
Conclusion: Frequency matters. If you see a pattern often enough without the exception, you learn the exception.

Experiment 2B: The "Meaning" Test
They took a verb that humans can't use in the passive voice (like "last") and edited the library so that "last" appeared in sentences where it was "affecting" things (like "The storm lasted the house down"—a made-up sentence where the house gets damaged).

Result: The robot started thinking "last" was more acceptable in the passive voice when it was used in these "damaging" contexts.
Conclusion: Meaning matters too. If the context suggests the object is being affected, the robot is more willing to use the passive voice.

Experiment 3: The "New Word" Test
To be super sure, they invented a brand new word (let's call it "Zorp") that didn't exist in the library at all. They taught the robot "Zorp" in two ways:

High Frequency: They showed it "Zorp" 1,000 times in active sentences, but zero times in passive.
Meaning: They showed "Zorp" in sentences where it either "hurt" things (High Affectedness) or just "existed" near things (Low Affectedness).

Result:
- The more they saw "Zorp" in the active voice, the less likely they were to use it in the passive (Frequency wins).
- The more they saw "Zorp" "hurting" things, the more likely they were to use it in the passive (Meaning wins).
- Crucially: These two factors worked independently. They didn't cancel each other out; they just added up.

The Big Takeaway

The researchers found that both suspects are guilty.

Frequency (how often you see a pattern) helps you learn the rules.
Meaning (whether the action changes the object) helps you understand why the rule exists.

It's not just one or the other. It's like learning to drive: you learn the rules because you see other cars following them (frequency), but you also understand the logic behind the rules (safety/meaning).

Why Does This Matter?

This study is a big deal because it proves that neural networks (AI) can learn complex language rules just like humans do, using the same "clues" from the environment. It also gives scientists a new way to study human learning: instead of trying to control what a human child hears (which is impossible), we can control what a robot hears, see how it learns, and use that to understand how our own brains might be working.

In short: We learn language by counting patterns and understanding meanings, and AI is finally smart enough to show us how that works.

Here is a detailed technical summary of the paper "Manipulating language models' training data to study syntactic constraint learning: the case of English passivization."

1. Problem Statement

Natural language grammatical rules often contain systematic exceptions (e.g., the verb last cannot be passivized, whereas defenestrate can). This presents a "learnability challenge" known as Baker's Paradox: how do learners acquire these exceptions when they are rarely, if ever, explicitly told that a sentence is ungrammatical?
The authors investigate two competing hypotheses regarding the indirect evidence learners use to infer these restrictions:

The Entrenchment Hypothesis: Learners track the statistical frequency of a verb in active vs. passive constructions. A high ratio of active-to-passive occurrences (A/P ratio) leads learners to conclude the passive form is ungrammatical.
The Affectedness Hypothesis: Learners rely on lexical semantics. A verb is passivizable only if its theme argument undergoes a change of state, location, or existence (is "affected") by the agent. Verbs lacking this semantic property (e.g., last, cost) are rejected in the passive.

The core difficulty in testing these hypotheses with human subjects is the inability to perfectly control the linguistic input a learner receives. This paper proposes using neural network language models as controlled proxies for human learners to causally test these hypotheses.

2. Methodology

The study employs a series of experiments using Transformer-based language models (specifically GPT-2 small, 117M parameters) trained on a corpus of approximately 100 million words (approximating the linguistic input available to a human by adolescence).

Experimental Design

Experiment 1 (Benchmarking):
- 1A: Collected human acceptability judgments (0–100 scale) for 140 sentence pairs (28 verbs) involving active and passive constructions.
- 1B: Trained models on the 100M word corpus and compared their "passive drop" (the difference in probability between active and passive sentences) against human judgments. A trigram model was also trained as a baseline to test if frequency alone suffices.
Experiment 2 (Causal Interventions on Existing Data):
- 2A (Entrenchment): The researchers manipulated the training corpus to alter the A/P ratio of specific "mutating" verbs (originally highly passivizable) to match that of "target" verbs (originally unpassivizable). They removed passive occurrences of the mutating verbs to increase their A/P ratio.
- 2B (Affectedness): The researchers manipulated the semantic context of mutating verbs (originally unpassivizable) by replacing them with target verbs (originally passivizable) in active transitive sentences. This forced the unpassivizable verbs to co-occur with agent-patient arguments, theoretically increasing their "affectedness" score without changing their passive frequency.
Experiment 3 (Novel Verb Interaction):
- Introduced a novel verb into the corpus that appeared only in active sentences.
- Systematically varied two factors: (1) The frequency of the novel verb (number of active occurrences), and (2) The semantic context (high-affectedness vs. low-affectedness carrier sentences generated by GPT-4o).
- This design allowed for testing the interaction between frequency and semantics without the confounds of filtering existing data.

Evaluation Metrics

Passive Drop: Calculated as the log-probability difference between the active and passive versions of a sentence pair.
Correlation: Pearson correlation coefficients between model outputs and human judgments.
Statistical Testing: Linear mixed-effects models were used to determine the significance of interventions on passive drop.

3. Key Results

Experiment 1: Human-Model Alignment

High Correlation: The Transformer models showed a very strong correlation with human judgments ( $r = 0.91$ ) regarding the gradient nature of passivizability.
Beyond Frequency: The models significantly outperformed a trigram frequency-based model ( $r = 0.68$ ), suggesting that the models utilize more than just short-sequence statistics to learn these constraints.
Gradient Judgments: Both humans and models exhibited gradient (non-binary) judgments, with some "unpassivizable" verbs (e.g., ooze class) being rated as more acceptable than others (e.g., duration class).

Experiment 2: Causal Evidence

Entrenchment (2A): Increasing the A/P ratio of a mutating verb significantly increased its passive drop (making it seem less passivizable). However, the effect was incomplete; the mutating verbs did not become as unpassivizable as the target verbs, suggesting frequency is a necessary but insufficient factor.
Affectedness (2B): Altering the semantic context of a mutating verb to include agent-patient arguments significantly decreased its passive drop (making it seem more passivizable). However, this effect was verb-dependent (e.g., cost and resemble responded well, but last did not), likely due to the models using a single embedding for all senses of a word.

Experiment 3: Interaction and Novelty

Independent Effects: Both frequency (entrenchment) and semantics (affectedness) independently predicted the passive drop of the novel verb.
- Higher frequency in the active voice $\rightarrow$ Higher passive drop (less passivizable).
- High-affectedness context $\rightarrow$ Lower passive drop (more passivizable).
No Interaction: There was no significant interaction between frequency and affectedness. The factors appeared to be additive rather than multiplicative.
Magnitude: The effect of frequency was stronger than the effect of affectedness.

4. Key Contributions

Methodological Innovation: The paper demonstrates a robust framework for causal intervention in language model training data. By manipulating corpora (removing sentences, swapping verbs, or injecting novel data), researchers can isolate specific learning cues that are impossible to control in human studies.
Validation of Indirect Evidence: It provides empirical evidence that neural networks can learn syntactic exceptions (like passivization constraints) solely from indirect statistical and semantic cues in the input, without explicit negative feedback.
Disentangling Hypotheses: The study confirms that both entrenchment (frequency asymmetry) and affectedness (semantic compatibility) are causal factors in learning passivization restrictions, but neither explains the phenomenon entirely on its own.
Gradient Nature of Grammar: The results reinforce that passivizability is a gradient phenomenon rather than a binary rule, with both humans and models showing sensitivity to fine-grained differences between verb classes and individual verbs.

5. Significance and Implications

Cognitive Modeling: The study validates the use of neural networks as "theories of acquisition." It shows that models trained on human-scale data can replicate human-like linguistic generalizations and exceptions, suggesting that the input data contains sufficient information for these constraints to be learned.
Limits of Current Models: The fact that neither frequency nor semantics alone fully accounted for the human judgment gap suggests that human learners (and potentially models) rely on additional sources of evidence, such as the existence of alternative constructions (e.g., required for vs. required by) or more complex semantic primitives.
Future Research Directions: The authors propose that this methodology can be extended to study cross-linguistic typology, the role of interactive feedback (which models currently lack), and the specific mechanisms of how semantic concepts are encoded in neural representations.

In conclusion, the paper establishes that frequency and semantics are independent, causal drivers in the learning of syntactic exceptions, and that manipulating training data in neural networks offers a powerful, controlled alternative to observational studies in understanding human language acquisition.