Semantic Chunking and the Entropy of Natural Language

This paper introduces a statistical model based on self-similar semantic chunking that explains the high redundancy and one-bit-per-character entropy rate of natural language, while predicting that this entropy rate systematically increases with the semantic complexity of the text.

Original authors: Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks

Published 2026-02-19
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Idea: Why is English So Predictable?

Imagine you are playing a game where you have to guess the next letter in a sentence.

  • Random Text: If the text were just random letters (like "XQZJ..."), you would have to guess blindly. There are 26 letters, so your "uncertainty" is high.
  • Real English: If the text is "The quick brown fox jumps over the...", you can almost certainly guess the next word is "the" and the one after is "lazy."

Scientists have known for a long time (since the 1950s) that English is 80% redundant. This means if you know the context, you only need about 1 bit of information (a simple yes/no choice) to predict the next character, instead of the ~5 bits you'd need for random noise.

The Question: Why is English so predictable? Is it just because of grammar rules? Or is there a deeper structure?

The Answer: The authors argue that English is predictable because it is built like a Russian Nesting Doll (or a set of folders inside folders). We don't just read word-by-word; we understand big ideas, which contain smaller ideas, which contain sentences, which contain words. This paper builds a mathematical model to prove that this "hierarchical" structure is exactly what creates the redundancy we see.


The Core Metaphor: The "Semantic Tree"

Imagine you are reading a story. To understand it, your brain doesn't just scan left to right. It builds a tree in your mind:

  1. The Trunk (Root): The whole story (e.g., "A story about a lost dog").
  2. Big Branches: Major sections (e.g., "The dog gets lost," "The dog searches," "The dog is found").
  3. Small Branches: Paragraphs or scenes.
  4. Leaves: Individual words.

The authors call this a "Semantic Tree." They propose that every text is essentially a tree where the big chunks are split into smaller, meaningful chunks, all the way down to single words.

The Experiment: Teaching a Robot to "Chunk"

To test this, the researchers used a modern AI (a Large Language Model) not to write, but to cut up text.

  1. They gave the AI a story.
  2. They asked: "Split this story into up to K meaningful pieces." (For example, split it into 4 main parts).
  3. Then, they took those 4 parts and asked the AI to split each of them into 4 smaller parts.
  4. They kept doing this until they reached single words.

The result was a digital "tree" of the text.

The Discovery: The "Magic Number" (K)

The researchers found that if they assumed a specific rule for how these trees grow, their math perfectly matched the "entropy" (predictability) measured by the AI.

The rule depends on a single number: K.

  • K represents the maximum number of branches a parent chunk can have.
  • Think of K as the limit of your working memory. How many main ideas can you hold in your head at once while reading?

The Results:

  • Children's Books: These are simple. The AI found they work best with a low K (around 2). You only need to hold two ideas in your head at a time.
  • Regular Text (News, Novels): These are standard. They fit a K of 4. This matches the famous historical estimate that English has an entropy of ~1 bit per character.
  • Poetry: This is complex and abstract. The AI found it needs a high K (around 6). Poetry forces you to juggle many conflicting or dense ideas simultaneously, making it harder to predict the next word.

The "Entropy" Connection

In physics and information theory, Entropy is a measure of surprise or uncertainty.

  • High Entropy: Total chaos (hard to predict).
  • Low Entropy: Highly structured (easy to predict).

The paper proves a beautiful link: The complexity of the text's "tree structure" (how many branches it has) directly determines how predictable the text is.

  • If a text has a simple tree (low K), it is very predictable (low entropy).
  • If a text has a complex tree (high K), it is harder to predict (higher entropy).

Why This Matters

  1. It explains the "1 bit" mystery: For decades, we knew English was ~1 bit per character, but we didn't know why. This paper says: "It's because our brains naturally organize language into trees with about 4 branches at each level."
  2. It measures difficulty: You can now use this math to measure how "hard" a text is. A poem isn't just "artsy"; it mathematically requires more cognitive load (a higher K) to process than a children's story.
  3. It bridges AI and Humans: The model suggests that the way AI predicts words and the way humans understand stories are governed by the same structural rules. Both rely on this hierarchical "chunking" of meaning.

Summary Analogy

Imagine language is a city.

  • Random text is like a city with no roads, just a grid of random streets. You have no idea where you are going.
  • Natural language is a city with a clear hierarchy: Continents > Countries > Cities > Neighborhoods > Streets > Houses.

The authors discovered that the "traffic" (information flow) in this city is so smooth because the city is built in layers. If you know the Continent, you can guess the Country. If you know the Country, you can guess the City.

The "Entropy" of the language is simply a measure of how many layers deep you have to go, and how many options you have at each layer, before you reach the specific house (the word). The paper shows that for English, the city is built with a specific blueprint (4 branches per level) that makes it surprisingly easy to navigate.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →