The complexity of finite smooth words over binary alphabets

Here is an explanation of the paper "The complexity of finite smooth words over binary alphabets" by Julien Cassaigne and Raphaël Henry, translated into everyday language with creative analogies.

The Big Picture: The "Infinite Puzzle"

Imagine you have a magical machine that takes a long string of numbers (like a secret code) and compresses it. This machine works by looking at how many times a number repeats in a row and replacing that whole group with just the number itself.

Example: If you feed the machine 221112, it sees "two 2s" and "three 1s" and "one 2". It outputs 231.
The Magic: If you feed that result (231) back into the machine, it compresses it again. If you can keep doing this forever without the machine ever crashing or producing nonsense, the original string is called a "Smooth Word."

The most famous of these is the Oldenburger-Kolakoski word. It's a famous mathematical mystery. We know it exists, but we don't fully understand its structure. It's like a fractal: if you zoom in, it looks complex, but it follows a hidden rule.

The Problem: Too Big to Handle

Mathematicians want to know: How complex is this word?
If you take a piece of the word that is 100 letters long, how many different 100-letter pieces can you find inside the infinite stream?

If the answer is small (like 100), the word is simple and predictable.
If the answer is huge (like a million), the word is chaotic and complex.

The problem is that the "Smooth Word" is infinite. You can't count the pieces of an infinite thing directly. It's like trying to count every single grain of sand on a beach that keeps growing forever.

The Solution: The "Finite Shadow"

To solve this, the authors introduced a concept called "f-smooth words" (finite smooth words).

Think of the infinite Smooth Word as a giant, endless mountain. You can't climb the whole thing. But, the authors realized that every single rock on that mountain is made of the same material as a specific set of small, finite stones (the f-smooth words).

The Paper's First Big Discovery (The "Shadow" Theorem):
The authors proved that every single piece of the infinite Smooth Word is actually just a piece of one of these finite stones.

Analogy: Imagine the infinite word is a giant, endless tapestry. The authors proved that every pattern you can find in the tapestry is also found in a specific box of small, finite fabric swatches.
Why this matters: Instead of trying to count patterns in the infinite mountain, we can just count the patterns in the finite box of stones. It turns an impossible task into a manageable one.

The Mystery of Complexity: The "Growth Rate"

Now that they have the finite box, they asked: How fast does the number of patterns grow as the patterns get longer?

There was a famous guess (conjecture) by a mathematician named Sing. He guessed that the complexity grows at a specific speed, determined by the numbers used in the alphabet (like 1 and 2, or 3 and 5).

The formula looks scary: $\Theta(n^{\rho})$ .

Think of $n$ as the length of the word.
Think of $\rho$ (rho) as the "steepness" of the growth hill.
Sing guessed the hill has a specific steepness based on the numbers you use.

What the Authors Did

The paper is a detective story where they tested Sing's guess under different conditions. They split the alphabet into two teams: Even Teams and Odd Teams.

1. The "Even" Team (e.g., {2, 4} or {2, 6})

When the numbers in the alphabet are both even, the world is very orderly.

The Result: The authors proved Sing's guess is 100% correct for these alphabets.
The Analogy: It's like a perfectly symmetrical tree. If you know the rule for one branch, you know the rule for the whole tree. The complexity grows exactly as predicted.

2. The "Odd" Team (e.g., {1, 3} or {3, 5})

When the numbers are odd, things get messy. The "mountain" has weird, jagged peaks.

The Result:
- They proved the minimum complexity (the floor): The word is at least as complex as Sing guessed.
- They improved the maximum complexity (the ceiling): They found a new, tighter limit on how complex it can get. It's not as wild as previous guesses suggested, but it's still more complex than the "Even" team.
The Analogy: Imagine a chaotic jungle. The authors couldn't map every single path, but they proved the jungle is at least as big as a certain size, and they built a fence around it to show it's not infinitely huge.

The "Mistake" They Fixed

The paper also acts as a fact-checker. Another researcher (Huang) had published a paper claiming to solve this problem for all alphabets.

The Issue: Huang used a slightly different definition of the "machine" (the derivative) that didn't quite match the real rules of the game. It was like measuring a room with a ruler that was slightly too short.
The Fix: The authors showed that Huang's results were based on a "fake" set of words that included patterns that couldn't actually exist in the real Smooth Word. They corrected the math to show the true complexity.

Summary: Why Should You Care?

We solved a piece of a 60-year-old puzzle: The Oldenburger-Kolakoski word has been a mystery since the 1960s. This paper proves that the "finite pieces" (f-smooth words) are exactly the building blocks of the infinite word.
We know the "speed limit" of complexity: We now know exactly how fast the patterns grow for even numbers, and we have a much better estimate for odd numbers.
It's about order in chaos: Even though these words look random, they follow strict mathematical laws. The authors showed us how to measure that order.

In a nutshell: The authors took a giant, infinite, confusing math problem, broke it down into a finite box of manageable pieces, proved that the box contains all the secrets of the infinite word, and then measured exactly how fast those secrets multiply. They confirmed the theory for "even" numbers and sharpened the theory for "odd" numbers, while also correcting a previous error in the field.

Here is a detailed technical summary of the paper "The complexity of finite smooth words over binary alphabets" by Julien Cassaigne and Raphaël Henry.

1. Problem Statement and Context

The paper addresses the factor complexity (the number of distinct factors of length $n$ ) of smooth words over binary alphabets $\mathcal{A} = \{a, b\}$ with $1 \le a < b$.

Smooth Words: Infinite words that are infinitely derivable under the run-length encoding operation. The most famous example is the Oldenburger-Kolakoski word ( $\kappa_{2,1}$ ) over $\{1, 2\}$ .
The Challenge: Determining the exact asymptotic growth rate of the factor complexity $p(n)$ for these words has been an open problem since the 1970s.
The Tool: Researchers study $f$ -smooth words (finite words that remain valid under repeated finite derivation). It is known that factors of smooth words are $f$ -smooth.
The Conjecture: Sing (generalizing Dekking) conjectured that the complexity of $f$ -smooth words over $\{a, b\}$ grows as:
$p_{C^\infty_f}(n) = \Theta\left(n^\rho\right) \quad \text{where} \quad \rho = \frac{\log(a+b)}{\log\left(\frac{a+b}{2}\right)}$
Previous Issues:
- Only polynomial bounds were known for general alphabets.
- A recent claim by Huang (2023) attempting to generalize bounds to all alphabets was found to contain a fundamental error regarding the definition of the derivative operation.
- The behavior differs based on the parity of the alphabet:
  - Mixed alphabets ( $a+b$ is odd): Smooth words seem to contain all $f$ -smooth words.
  - Even alphabets ( $a, b$ both even): Smooth words have linear complexity and do not contain all $f$ -smooth words.
  - Odd alphabets ( $a, b$ both odd): Smooth words have linear complexity and miss many $f$ -smooth words.

2. Methodology

The authors employ a combination of combinatorial word theory, structural analysis of bispecial factors, and linear algebra (spectral radius analysis).

A. Correction of Previous Errors

The authors first identify a mistake in Huang's work [11]. Huang used an incorrect definition of the finite derivative ( $D$ ) instead of the correct one ( $D_f$ ). This led to an overestimation of the set of $f$ -smooth words and incorrect complexity bounds. The paper clarifies the correct definition of $D_f$ and the resulting set $C^\infty_f$ .

B. Structural Equivalence (Theorem 1.31)

The authors prove that the set of factors of infinite smooth words is exactly the set of $f$ -smooth words ( $L(C^\infty) = C^\infty_f$ ).

Method: They introduce $r$ -smooth words (right-derivable words) as a bridge. They show that every $f$ -smooth word can be extended to an $r$ -smooth word, and every $r$ -smooth word is a prefix of an infinite smooth word. This establishes that studying the complexity of $C^\infty_f$ is equivalent to studying the complexity of the factors of smooth words.

C. Bispecial Factor Analysis

To compute complexity, the authors utilize the theory of bispecial factors (factors that can be extended both to the left and right in multiple ways).

Tree Construction: They define finite primitives ( $P_{f,a}, P_{f,b}$ ) which generate new bispecial words from existing ones.
Tree Families: They categorize bispecial $f$ -smooth words into infinite binary trees rooted at specific "short" words (e.g., $\epsilon, aa, ba$ ).
Complexity Formula: Using the second finite difference of the complexity function, they express $p_{C^\infty_f}(n)$ as a sum involving the lengths of words in these trees (specifically the generations $T_i$ ).

D. Asymptotic Analysis

Lower Bound: They calculate the average length of words in the $i$ -th generation of the tree. By analyzing the recurrence relation of these lengths, they derive the lower bound for the complexity.
Upper Bound: They analyze the minimal length ( $l_i$ $l_{i}$ ) of words in the $i$ $i$ -th generation.
- Even Alphabets: They prove $l_i$ grows exactly as $(\frac{a+b}{2})^i$ , matching the conjecture.
- Odd Alphabets: They use linear algebra (matrices $M$ and $P$ ) to model the growth of word lengths. They determine the spectral radius ( $\lambda$ ) of these matrices to bound the growth of $l_i$ , leading to a new upper bound exponent $\zeta$ .

3. Key Contributions and Results

1. Equivalence of Languages (Theorem 1.31)

Result: Over any binary alphabet, the language of factors of smooth words is exactly the set of $f$ -smooth words: $L(C^\infty) = C^\infty_f$ .
Significance: This validates the study of $f$ -smooth words as the primary method for understanding smooth word complexity, regardless of the alphabet parity.

2. Proof of the Lower Bound (Theorem 1.32)

Result: For any binary alphabet $\{a, b\}$ , the complexity satisfies $p_{C^\infty_f}(n) = \Omega(n^\rho)$ with $\rho = \frac{\log(a+b)}{\log(\frac{a+b}{2})}$ .
Significance: This confirms the conjectured lower bound for all binary alphabets, resolving a long-standing open question.

3. Proof for Even Alphabets (Theorem 1.32)

Result: If $a$ and $b$ are both even, $p_{C^\infty_f}(n) = \Theta(n^\rho)$ .
Significance: This fully proves Sing's conjecture for the class of even alphabets. The complexity is tightly bounded by the conjectured exponent.

4. Improved Upper Bound for Odd Alphabets (Theorem 1.33)

Result: For odd alphabets, the authors provide a new upper bound $p_{C^\infty_f}(n) = O(n^\zeta)$ $p_{C_{f}^{\infty}} (n) = O (n^{ζ})$ , where $\zeta = \frac{\log(2\lambda)}{\log(\lambda)}$ $ζ = \frac{l o g ( 2 λ )}{l o g ( λ )}$ .
- $\lambda$ is the dominant root of a specific cubic polynomial (or a quadratic if $a=1$ ).
Significance:
- This improves upon previous polynomial bounds (Theorem 1.21).
- While it does not yet prove the conjecture $\zeta = \rho$ for odd alphabets, it narrows the gap significantly.
- The paper notes that for odd alphabets, the minimal length $l_i$ and maximal length $L_i$ of bispecial words grow at different rates, making the exact determination of complexity more difficult than in the even case.

5. Refutation of Huang's Claims

The paper explicitly demonstrates why Huang's generalized bounds were incorrect due to the misuse of the derivative operator, preventing future researchers from following that flawed path.

4. Significance and Future Directions

Resolution of the "Mixed" Case: The paper clarifies that for mixed alphabets (where $a+b$ is odd), the complexity of smooth words is identical to $f$ -smooth words, but the behavior of individual smooth words (recurrence, frequencies) remains complex.
Parity Dichotomy: The work highlights a fundamental difference between even and odd alphabets. Even alphabets behave "regularly" (conjecture holds), while odd alphabets exhibit "pathological" behavior where smooth words do not contain all $f$ -smooth words, and the complexity bounds are harder to pin down.
Methodological Advance: The use of matrix spectral radii to bound the growth of minimal word lengths in the bispecial tree provides a powerful new tool for analyzing similar combinatorial structures.
Open Problems:
- Proving the conjecture $p(n) = \Theta(n^\rho)$ for odd alphabets (i.e., proving $\zeta = \rho$ ).
- Determining the exact factor frequencies for smooth words over general alphabets (generalizing Keane's conjecture).

In summary, this paper represents a major step forward in the combinatorics on words field, settling the complexity conjecture for even alphabets, establishing the lower bound for all alphabets, and correcting previous literature errors, while providing a refined framework for tackling the remaining odd-alphabet case.