Modelling the Diachronic Emergence of Phoneme Frequency Distributions

Imagine a language as a giant, bustling marketplace where every sound (like "b," "k," or "sh") is a different type of fruit. Some fruits are everywhere (like apples), while others are rare (like durians).

For a long time, linguists have noticed two strange rules about how these "fruits" are distributed in languages around the world:

The "Rich Get Richer" Pattern: A few sounds are used constantly, while most are used very rarely. This creates a specific curve on a graph.
The "Balance" Mystery: Languages with many different sounds (a huge fruit market) tend to have a very predictable, low-energy distribution. Languages with few sounds have a more chaotic, high-energy distribution. It's as if the universe is trying to "balance the books" between the number of sounds and how they are used.

The big question this paper asks is: Did these patterns happen because languages are smart and trying to optimize themselves? Or did they just happen by accident over thousands of years of history?

The authors, Fermín and Suchir, decided to find out by building a time machine for sounds.

The Time Machine Experiment

They created a computer simulation that acts like a "sound evolution" video game. They started with a bunch of imaginary languages, each with a standard set of sounds. Then, they let time run forward, introducing random "accidents" to the sounds, just like real history:

Splitting: One sound accidentally breaks into two (like a cell dividing).
Merging: Two sounds accidentally crash into one (like two cars merging into a single lane).
Shifting: A sound changes slightly into another existing sound.

They ran three different versions of this game to see which one matched real human languages.

Level 1: The Naïve Chaos (The "Roll the Dice" Model)

The Setup: They let the sounds change completely at random. Every sound had an equal chance of splitting or merging, like rolling a die to decide what happens next.
The Result: The "fruit market" looked okay at first glance—it had that "rich get richer" curve. But, it failed the second test. In this simulation, languages with more sounds ended up being more chaotic, not less. It was the opposite of real life. Also, the number of sounds in these languages kept growing or shrinking wildly, with no limit. It was like a market that kept adding new exotic fruits forever or losing them all until only two remained.

Level 2: The "Functional Load" Twist (The "Popular vs. Obscure" Model)

The Setup: The researchers added a rule based on reality: Rare sounds are more fragile. If a sound is rarely used, it's more likely to disappear or merge with another. Common sounds are "sturdy" and stick around.
The Result: This made the "fruit market" even more skewed. The common fruits became super common, and the rare ones vanished. But, it still failed the second test. The relationship between the number of sounds and the "chaos" of the distribution was still wrong. The languages were still drifting apart in size without any limit.

Level 3: The "Goldilocks" Stabilizer (The Winning Model)

The Setup: The researchers realized that real languages don't have infinite sounds or just two sounds. They tend to hover around a "sweet spot" (about 30–40 sounds for most languages). They added a rule: If a language gets too big, it's harder to add new sounds. If it gets too small, it's harder to lose sounds. It's like a thermostat that tries to keep the room at a comfortable temperature.
The Result: Bingo! This simple rule fixed everything.

The languages settled into a realistic size range (no more infinite markets).
The "fruit distribution" looked exactly like real languages.
Crucially: It recreated the "Balance Mystery." Languages with more sounds naturally became more predictable, and languages with fewer sounds were more chaotic.

The Big Takeaway: Accidental Order

The most exciting part of this discovery is why it happened.

The authors found that they didn't need to program the languages to be "smart" or "efficient." They didn't need a rule saying, "Hey, if you have too many sounds, you must organize them better!"

Instead, the "balance" emerged naturally as a side effect. It's like a crowded dance floor:

If the room is small (few sounds), people bump into each other constantly (high chaos).
If the room is huge (many sounds), people spread out and move more predictably (low chaos).

The "Compensation" we see in languages isn't necessarily a conscious effort by speakers to optimize their speech. It's just the natural result of sounds evolving randomly over time, while being gently nudged back toward a comfortable, average size.

In short: The universe didn't need a master planner to organize the sounds of human language. It just needed a little bit of random history and a gentle "thermostat" to keep things from getting too wild. The patterns we see are the natural footprints of time.

Here is a detailed technical summary of the paper "Modelling the Diachronic Emergence of Phoneme Frequency Distributions" by Moscoso del Prado Martín and Salhan.

1. Problem Statement

The paper addresses the origin of robust statistical regularities observed in phoneme frequency distributions across human languages. Two primary empirical patterns are noted:

Exponential-Tailed Distributions: Phoneme rank-frequency plots exhibit exponential tails rather than the power-law tails often seen in word frequency distributions.
Negative Correlation (Compensation Hypothesis): There is a negative correlation between a language's Phonemic Inventory Size (PIS) and the relative entropy of its phoneme distribution. This suggests that as the number of phonemes increases, the informational content per phoneme decreases (a phenomenon often interpreted as "compensation" or functional optimization).

The Core Question: Do these patterns arise from explicit evolutionary optimization mechanisms (e.g., languages actively balancing complexity), or are they emergent properties (epiphenomena) resulting from the stochastic historical processes of sound change?

2. Methodology

The authors propose a stochastic model of phonological change based on Hoenigswald's (1965) typology of sound changes. The model simulates the diachronic evolution of phoneme inventories over discrete time steps.

The General Model

State: At time $\tau$ , a language has $V_\tau$ phonemes with a probability distribution vector $p_\tau$ .
Events: At each step, one of three change types occurs, sampled from an alphabet $\Sigma = \{p, s, m\}$ $Σ = {p, s, m}$ :
1. Primary Split (Conditioned Merger): A portion of a phoneme's probability mass shifts to an existing phoneme. Inventory size ( $V$ ) remains constant (unless the source is fully absorbed).
2. Secondary Split (Phonemic Split): A portion of a phoneme's mass creates a new phoneme. $V$ increases by 1.
3. Unconditioned Merger: Two phonemes collapse into one. $V$ decreases by 1.
Parameters: A proportion parameter $\alpha_\tau \in (0, 1]$ determines the magnitude of probability mass transfer.

Simulation Progression

The authors tested three incremental versions of the model to determine which assumptions are necessary to replicate empirical data:

Simulation 1 (Naïve Baseline):
- Assumptions: Equal probabilities for all change types ( $P(p)=P(s)=P(m)=1/3$ ). Phonemes are chosen uniformly at random.
- Goal: Establish a baseline for stochastic evolution without bias.
Simulation 2 (Functional Load Bias):
- Assumptions: Incorporates the Functional Load Hypothesis. High-frequency phonemes (proxy for high functional load) are less likely to be split or merged.
- Mechanism: Phonemes to be reduced (split/merged) are sampled with probability proportional to their surprisal (inverse frequency), making rare phonemes more likely to be lost. High-frequency phonemes are sampled uniformly for receiving mass.
Simulation 3 (Central Tendency):
- Assumptions: Adds a stabilizing tendency toward a preferred inventory size ( $\mu$ ).
- Mechanism: The probabilities of splits and mergers become dependent on the current inventory size $V_\tau$ $V_{τ}$ .
  - If $V_\tau > \mu$ , the probability of splits ( $P(s)$ ) decreases, and mergers ( $P(m)$ ) increase.
  - If $V_\tau < \mu$ , the reverse occurs.
- Implementation: Uses exponential functions to create a smooth bias toward $\mu$ (set to 34, the global mean PIS).

3. Key Results

Simulation	Rank-Frequency Shape	PIS vs. Relative Entropy Correlation	Inventory Size Dynamics
Simulation 1 (Naïve)	Matches empirical shape (exponential tail).	Positive ( $r = .47$ ). Contradicts reality.	Unbounded random walk; variance increases indefinitely.
Simulation 2 (Functional Load)	Matches shape but with high variance; "rich-get-richer" effect.	Positive ( $r = .68$ ). Contradicts reality.	Unbounded random walk; variance increases indefinitely.
Simulation 3 (Central Tendency)	Matches empirical shape; reduced variance in low ranks.	Negative ( $r = -.12$ ). Matches reality.	Converges to a stationary distribution around $\mu$ .

Detailed Findings:

Exponential Tails: Even the naïve model (Simulation 1) successfully generates exponential-tailed rank-frequency distributions, suggesting this pattern is a natural consequence of stochastic redistribution of probability mass.
The Entropy Correlation: The negative correlation between PIS and relative entropy (the "compensation" effect) only emerged in Simulation 3.
- In Simulations 1 and 2, the inventory size acted as a random walk with no upper bound, leading to a positive correlation.
- In Simulation 3, the constraint that inventory sizes tend to revert to a mean ( $\mu$ ) created a dynamic where larger inventories naturally develop more predictable (lower entropy) distributions to maintain stability, and smaller inventories have higher entropy.
Epiphenomenal Nature: The results indicate that the observed "compensation" is not necessarily the result of an active optimization mechanism but a statistical side effect of stochastic change constrained by a preferred inventory size.

4. Key Contributions

Generative Explanation: The paper provides a generative model demonstrating that complex macroscopic statistical regularities in phonology can arise from simple, local diachronic processes without assuming explicit optimization.
Refutation of Necessity for Compensation: It challenges the view that the negative PIS-entropy correlation must be caused by a compensatory mechanism. Instead, it shows this correlation can be an epiphenomenon of stochastic dynamics combined with a stabilizing tendency.
Role of Central Tendency: It identifies the "central tendency" (stabilization of inventory size) as a critical, previously under-modeled factor required to replicate the full suite of empirical phonological statistics.
Methodological Framework: It establishes a flexible stochastic framework for simulating phonological evolution that can be extended to test other hypotheses regarding sound change.

5. Significance

Theoretical Impact: The findings suggest that what linguists often interpret as evidence of "functional optimization" or "compensatory organization" in language systems may actually be the natural outcome of historical drift constrained by typological limits. This shifts the burden of proof for optimization hypotheses.
Typological Consistency: The model aligns with typological surveys (e.g., Maddieson, 1984) which show that phoneme inventories are concentrated within a narrow range, supporting the idea of a "preferred" size in human language evolution.
Future Research: The paper opens avenues for investigating whether other linguistic regularities (e.g., in syntax or morphology) are similarly emergent properties of stochastic processes rather than results of direct design or optimization.

In conclusion, the authors demonstrate that the statistical structure of phoneme frequencies is likely a natural consequence of diachronic sound change operating under a stabilizing pressure on inventory size, rather than a result of explicit compensatory mechanisms.

Modelling the Diachronic Emergence of Phoneme Frequency Distributions

The Time Machine Experiment

Level 1: The Naïve Chaos (The "Roll the Dice" Model)

Level 2: The "Functional Load" Twist (The "Popular vs. Obscure" Model)

Level 3: The "Goldilocks" Stabilizer (The Winning Model)

The Big Takeaway: Accidental Order

1. Problem Statement

2. Methodology

The General Model

Simulation Progression

3. Key Results

4. Key Contributions

5. Significance

More like this

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents