Non-Zipfian Distribution of Stopwords and Subset Selection Models

Imagine you are organizing a massive library containing millions of books. You want to understand the "vibe" of the library by looking at which words appear most often.

In the world of language, there's a famous rule called Zipf's Law. It's like a strict hierarchy: the most common word appears twice as often as the second most common, three times as often as the third, and so on. If you plot this on a graph, it looks like a perfectly straight, steep slide going down.

But this paper asks a funny question: What happens if we only look at the "boring" words?

The "Stopwords" (The Library's Background Noise)

In computer science, we call words like "the," "and," "is," "of," and "to" stopwords. They are the glue of a sentence. If you remove them, the sentence might look weird, but you can still guess the meaning.

Example: "The quick brown fox jumps over the lazy dog."
Without stopwords: "Quick brown fox jumps over lazy dog." (You still get it!)

Usually, these stopwords are the kings of the library. They sit at the very top of the popularity list.

The Big Discovery: The Slide Gets Curved

The authors of this paper did something clever. They took a huge collection of text (like Moby Dick and a massive database of news articles), filtered out all the "meaningful" words, and looked only at the stopwords.

They expected the stopwords to follow the same straight-line rule (Zipf's Law) as the rest of the library. They were wrong.

Instead of a straight slide, the stopwords formed a curved slide. It starts steep, but then it bends and flattens out at the bottom. In math terms, this curve is called a Beta Rank Function (BRF).

The Analogy: The VIP List vs. The General Admission

To understand why this happens, imagine a concert.

The Full Crowd (All Words): Imagine a concert where everyone follows a strict rule: The most famous singer gets 1,000 fans, the second gets 500, the third gets 333, etc. This is a straight line.
The Stopwords (The VIPs): Now, imagine you only let the "VIPs" (the stopwords) into a special room.
- The top VIPs (like "the" and "and") are still there.
- But as you go down the list, the rules change. The "less popular" stopwords (like "upon" or "hence") get filtered out much more aggressively than the popular ones.
- It's like a bouncer at the door who says, "If you are in the top 100, you're in. If you are rank 1,000, you're out. If you are rank 10,000, you definitely can't come in."

Because the bouncer (the selection process) is stricter on the lower-ranked words, the crowd in the VIP room doesn't look like a straight line anymore. It curves. The paper proves that this "bouncer rule" (mathematically called a Hill's Function) is exactly how nature selects stopwords.

What About the "Meaningful" Words?

The paper also looked at the words that weren't stopwords (the nouns, verbs, and adjectives).

If you take the stopwords out of the library, the remaining words don't form a straight line either.
Instead, they form a curved line that looks like a parabola (like the path of a thrown ball).
The authors found that a simple "quadratic" formula (a math equation with a squared number) fits these "meaningful" words better than any other rule.

Why Does This Matter?

Think of it like this:

Zipf's Law is the rule for the whole library.
Beta Rank Function is the rule for the "boring" background noise.
Quadratic Curves are the rule for the "interesting" content.

The authors built a computer model to simulate this. They started with a perfect straight line (Zipf's Law) and applied their "bouncer rule" to pick out the stopwords. The result? The computer naturally produced the curved "Beta Rank" shape that they saw in real life.

The Takeaway

Language isn't just one simple rule. It's a layered system:

The Whole: Follows a straight, predictable power law.
The Noise (Stopwords): Follows a curved rule because they are a specific "subset" of the whole, filtered by how useful they are.
The Content: Follows a different curved rule because they are what's left over after the noise is removed.

By understanding these different shapes, we can build better AI, search engines, and tools to analyze how humans write and speak. It turns out that even the "boring" words have a very specific, mathematical personality of their own!

Here is a detailed technical summary of the paper "Non-Zipfian Distribution of Stopwords and Subset Selection Models" by Wentian Li and Oscar Fontanelli.

1. Problem Statement

The paper addresses a fundamental discrepancy in quantitative linguistics regarding the statistical distribution of word frequencies.

The Context: It is well-established that the rank-frequency distribution of all words in a natural language text generally follows Zipf's Law (an inverse power-law distribution, $T(r) \propto r^{-\alpha}$ with $\alpha \approx 1$ ).
The Gap: While stopwords (function words like "the," "is," "of") dominate the high-frequency "head" of the full word list, their internal rank-frequency distribution (when stopwords are isolated from the rest of the text) does not follow Zipf's Law.
The Question: What is the specific functional form of the rank-frequency plot for stopwords? Furthermore, how does the process of selecting a subset (stopwords) from a Zipfian parent distribution alter the statistical properties of that subset? The authors also investigate the distribution of the remaining "non-stopwords."

2. Methodology

The authors employed a combination of empirical data analysis, statistical modeling, and analytic derivation.

Data Sources:
- Texts: The Brown Corpus (~~1.1M tokens) and Moby Dick (~~210k tokens) for initial analysis. A validation set of 30 Project Gutenberg books was used for independent model testing.
- Stopword Lists: Three distinct lists were utilized:
  1. NLTK: 198 entries (123 non-contracted).
  2. spaCy: 305 entries.
  3. Snowball: 175 entries (used for validation).
Fitting Functions: The authors tested four mathematical models against the rank-frequency data:
1. Zipf's Law: $T = c/r^\alpha$ .
2. Quadratic Correction: $\log(T) = c' - \alpha \log(r) - \kappa(\log(r))^2$ .
3. Beta Rank Function (BRF): $T = c(r_{max} + 1 - r)^\beta / r^\alpha$ .
4. Mandelbrot Function: $T = c/(r+B)^\alpha$ .
Data Sampling: To avoid bias toward the "tail" (rare words) in log-log plots, the authors used log-evenly sampled data points rather than using every single rank, ensuring a balanced visual and statistical fit.
Subset Selection Model: The authors proposed a probabilistic model where the probability of a word at rank $r$ being selected as a stopword follows a decreasing Hill's function:
$P(\text{stopword})_r = \frac{1}{1 + (r/r_{mid})^\gamma}$
Conversely, the probability of not being selected (remaining a non-stopword) follows an increasing Hill's function.

3. Key Contributions

A. Discovery of the Beta Rank Function (BRF) for Stopwords

The primary empirical finding is that the rank-frequency distribution of stopwords is not Zipfian. Instead, it is best fitted by the Beta Rank Function (BRF).

When stopwords are isolated, their plot curves significantly in log-log space.
The BRF provides a near-perfect fit ( $R^2$ values close to 1) for stopwords across different text sources and stopword lists, whereas Zipf's law fails to capture the curvature.

B. The Subset Selection Mechanism

The paper provides a theoretical mechanism explaining why stopwords follow a BRF.

Mechanism: Stopwords are a subset of the full vocabulary. The selection probability is not random; it is rank-dependent. High-rank words (very frequent) have a high probability of being stopwords, while low-rank words have a low probability.
Mathematical Derivation: The authors analytically prove that if a parent dataset follows Zipf's Law ( $T \propto r^{-\alpha}$ ) and a subset is selected via a decreasing Hill's function probability, the resulting rank-frequency distribution of the subset converges to the BRF form:
$T(r_{new}) \sim \frac{(R - r_{new})^{\alpha/(\gamma-1)}}{r_{new}^\alpha}$
This explains the emergence of the BRF parameters $\alpha$ and $\beta$ from the underlying Zipfian structure and the selection parameters ( $r_{mid}, \gamma$ ).

C. Non-Stopwords Follow a Quadratic Law

Contrary to the assumption that removing stopwords leaves a "clean" Zipfian distribution of content words, the authors found that non-stopwords deviate significantly from Zipf's Law.

Finding: The rank-frequency plot of non-stopwords is best fitted by a quadratic function of the log-log scale: $\log(T) \approx -\alpha \log(r) - \kappa(\log(r))^2$ .
Reasoning: This deviation arises because the removal of stopwords (which are heavily concentrated at the top ranks) distorts the rank ordering of the remaining words. The "head" of the non-stopword list is compressed, creating a curvature that a simple power law cannot capture.

D. Validation

The subset selection model was validated using an independent dataset (30 books) and a different stopword list (Snowball). The direct estimation of selection probabilities from this independent data yielded parameters ( $r_{mid} \approx 75, \gamma \approx 1.78$ ) consistent with the model derived from the initial texts, confirming the robustness of the Hill's function approach.

4. Results Summary

Feature	Full Word List	Stopwords (Subset)	Non-Stopwords (Remainder)
Distribution	Zipf's Law (Power Law)	Beta Rank Function (BRF)	Quadratic Function (in log-log)
Fit Quality	High (Linear in log-log)	Excellent (Curved in log-log)	Poor for Zipf; Excellent for Quadratic
Key Parameters	Exponent $\alpha \approx 1$	Exponents $\alpha, \beta$ derived from selection	Linear term $\alpha$ , Quadratic term $\kappa$
Mechanism	Natural language generation	Subset selection via decreasing Hill's function	Residual distribution after subset removal

Stopwords: 92–100% of the top 50 words in the analyzed texts were stopwords.
Fitting Performance: For non-stopwords, the quadratic function achieved $R^2 > 0.99$ , significantly outperforming Zipf's law ( $R^2 \approx 0.96$ ) and the Mandelbrot function.

5. Significance and Implications

Refinement of Linguistic Laws: The paper challenges the universality of Zipf's Law, demonstrating that it applies to the aggregate of words but breaks down for specific subsets like stopwords. It establishes BRF as the correct statistical descriptor for stopwords.
Theoretical Insight into Subset Dynamics: The work provides a general mathematical framework for understanding how selecting a subset from a power-law distribution alters the resulting distribution. This has implications beyond linguistics, potentially applicable to network science, economics, and biology where subsets of power-law systems are analyzed.
NLP and Text Processing:
- Stopword Removal: The findings suggest that simply removing stopwords does not leave a "pure" Zipfian distribution of content words. Algorithms relying on Zipfian assumptions for content words (e.g., in topic modeling or keyword extraction) may need to account for the quadratic deviation.
- Modeling: The subset selection model offers a generative explanation for observed linguistic patterns, moving beyond descriptive fitting to mechanistic understanding.
Methodological Contribution: The paper highlights the importance of log-evenly sampling in rank-frequency analysis to avoid fitting biases, a technique that improves the accuracy of distributional modeling in quantitative linguistics.

In conclusion, Li and Fontanelli demonstrate that stopwords are not merely "noise" but a distinct statistical entity governed by a Beta Rank Function, resulting from a specific rank-dependent selection process from a Zipfian parent distribution. This discovery necessitates a shift in how quantitative linguists model word frequency distributions.