Phonological distances for linguistic typology and the origin of Indo-European languages

This paper demonstrates that an information-theoretic analysis of short-range phoneme dependencies in 67 modern languages can effectively reconstruct major language families, detect contact-induced convergence, and support the Steppe hypothesis for the Indo-European homeland by revealing a strong correlation between phonological and geographic distances.

Original authors: Marius Mavridis, Juan De Gregorio, Raul Toral, David Sanchez

Published 2026-04-14
📖 4 min read☕ Coffee break read

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery that is thousands of years old: Where did the Indo-European language family (which includes English, Spanish, Hindi, Russian, and many others) originally come from?

Usually, linguists solve this by looking at ancient words, like comparing the word for "mother" or "fire" across different languages. But this new paper takes a different approach. Instead of looking at what words mean, they looked at how the sounds of the languages feel in your mouth.

Here is the story of their investigation, explained simply.

1. The "Sound Fingerprint"

Imagine every language has a unique "sound fingerprint." This isn't just about which letters are used, but how the sounds flow together.

The researchers took the Bible (because it's available in almost every language and has a similar length) and turned the text into a string of pure sounds (phonemes). They didn't just look at single sounds; they looked at groups of three sounds (like "b-u-k" in "book").

Think of it like listening to a song. If you only listen to one note, you don't know the melody. But if you listen to three notes in a row, you start to hear the rhythm and the style. The researchers found that these "three-sound groups" capture the unique musical style of a language's sound system.

2. The "Mouth Gym" Map

To measure how different two languages are, they didn't just count how many sounds were different. They looked at how hard your mouth has to work to switch from one sound to another.

  • Analogy: Imagine your mouth is a gym.
    • Moving from a sound made with your lips (like "b") to another lip sound (like "p") is a tiny, easy step.
    • Moving from a lip sound to a sound made deep in your throat (like "k") is a giant, exhausting leap.

They built a map where languages that use similar "mouth movements" are close together, and languages that require very different gymnastics are far apart.

3. The Big Discovery: Distance Equals Time

When they plotted all 67 languages on this map, something magical happened. They found a strong rule: The farther two languages are geographically, the more different their sound fingerprints are.

  • The Logic: Imagine a group of people leaving a home village. As they walk further away, they drift apart. Over centuries, their accents change. The people who stayed home sound most like the original group. The people who walked the furthest developed the most unique accents.
  • The Result: The researchers found that the "sound distance" between languages perfectly matched their "geographic distance."

4. Solving the Mystery of the "Homeland"

Now, they applied this rule to the Indo-European family. They asked: "If we treat the average sound of all these languages as a 'center of gravity,' where on the map would that center be?"

  • They calculated the "sound distance" from every modern Indo-European language (like English, Hindi, Greek) back to a theoretical "average ancestor."
  • Then, they translated that sound distance back into miles.
  • The Answer: The spot that minimized the error was north of the Black Sea, in the grassy plains known as the Pontic-Caspian Steppe.

Why This Matters

This supports the "Steppe Hypothesis," which suggests that the ancestors of Indo-European speakers were nomads on horseback who started in the steppes (modern-day Ukraine/Russia) and spread out, carrying their language with them.

This contradicts the older "Anatolian Hypothesis," which suggested the origin was in modern-day Turkey (where farming began earlier). The "sound math" of this paper leans heavily toward the Steppe.

The Takeaway

The authors didn't need to dig up ancient bones or translate old tablets. They simply used math and the physics of how our mouths move to prove that languages, like people, leave a trail of "sound footprints" that reveal where they started and how far they traveled.

It's like finding a family's origin story just by listening to how they all laugh.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →