Correspondence Analysis and PMI-Based Word Embeddings: A Comparative Study

Imagine you are trying to teach a computer what words mean. In the old days, we did this by building a giant spreadsheet (a matrix) that counted how often every word appeared next to every other word in a massive library of books. If "king" and "queen" always appeared near each other, the computer learned they are related.

This paper is a friendly competition between two different ways of processing that spreadsheet to teach the computer.

The Contenders

1. The Popular Stars (PMI-based methods like Word2Vec and GloVe)
Think of these methods as high-powered calculators. They look at the spreadsheet and use a specific mathematical trick called "Pointwise Mutual Information" (PMI). They ask: "How much more likely are these two words to appear together than if they were just random strangers?"

The Flaw: These calculators get very excited by rare, extreme events. If a weird, specific phrase appears just once in a billion words, the calculator might think it's the most important thing in the universe. This "noise" can mess up the final result.

2. The Old School Statistician (Correspondence Analysis or CA)
Think of this method as a wise, steady librarian. It also looks at the same spreadsheet, but it uses a different mathematical lens called "Correspondence Analysis." Instead of just calculating probabilities, it looks at the structure of the data, asking how the words deviate from a random pattern.

The Discovery: The authors realized that the "Wise Librarian" (CA) and the "High-Powered Calculator" (PMI) are actually doing very similar things, just speaking slightly different mathematical dialects.

The New Twist: The "Root" Variants

The authors noticed that the "Wise Librarian" (standard CA) was still getting distracted by those extreme, noisy numbers in the spreadsheet. So, they invented two new ways to clean up the data before the librarian looked at it:

ROOT-CA (The Square Root): Imagine you have a pile of rocks, and some are tiny pebbles while others are massive boulders. The boulders are so heavy they crush the pebbles. This method takes the "square root" of the weight of every rock. Suddenly, the massive boulders aren't quite so massive, and the pebbles aren't quite so tiny. It levels the playing field.
ROOTROOT-CA (The Fourth Root): This is even gentler. It takes the square root again. Now, the boulders are barely bigger than the pebbles. It's like putting a soft filter over the data so the extreme outliers don't scream so loud.

The Race: Who Won?

The authors ran a marathon using three different "libraries" (corpora: Text8, British National Corpus, and Wikipedia) and tested the methods on four different "obstacle courses" (word similarity tests like "How similar is a tiger to a cat?").

Here is what happened:

The Calculator vs. The Librarian: The standard "Wise Librarian" (CA) did okay, but the "High-Powered Calculators" (PMI methods) were generally better.
The Problem with the Calculator: The authors found that the calculators were getting tripped up by those extreme, noisy numbers. When they tried to fix the calculator by weighting the data (PMI-GSVD), it actually got worse because the noise became even louder.
The Victory of the "Root" Methods: When the authors applied the Square Root and Fourth Root filters to the Librarian's data, the Librarian suddenly became a superstar.
- ROOT-CA and ROOTROOT-CA performed slightly better than the popular PMI calculators.
- They were so good that they could compete with BERT, a modern, super-complex AI model that uses massive neural networks (think of BERT as a super-intelligent robot that reads the whole book at once).

Why Does This Matter?

You might think, "Why bother with old-school math when we have giant AI models like BERT?"

Simplicity is Power: The "Root" methods are much simpler and faster. They don't need a supercomputer to run.
Noisy Data: Real-world data is messy. These new methods show that by simply "turning down the volume" on the extreme outliers (using the root transformations), you can get better results than complex algorithms that try to over-analyze every single detail.
The Lesson: Sometimes, you don't need a bigger, more complex engine; you just need to smooth out the bumps in the road.

The Bottom Line

This paper tells us that Correspondence Analysis, an old statistical technique, is actually a hidden gem for understanding language. By applying a simple mathematical "filter" (taking the root of the numbers) to calm down the noisy data, these methods can outperform the popular "calculator" methods and even hold their own against the giant AI models. It's a reminder that in the world of data, sometimes the best solution is to simplify, not complicate.

Here is a detailed technical summary of the paper "Correspondence Analysis and PMI-Based Word Embeddings: A Comparative Study."

1. Problem Statement

Word embeddings are fundamental to Natural Language Processing (NLP). While modern transformer-based models (e.g., BERT) dominate the field, traditional static word embedding methods remain relevant for specific tasks, interpretability, and low-resource settings.

The Gap: Popular static methods like GloVe and Word2Vec (SGNS) are theoretically linked to the factorization of the Pointwise Mutual Information (PMI) matrix. However, Correspondence Analysis (CA), a well-established statistical technique for analyzing contingency tables, has not been systematically compared to these PMI-based methods in the context of word embeddings.
The Challenge: Standard CA applied to word-context matrices may suffer from overdispersion (where variance exceeds the mean, common in count data) and the influence of extreme values (outliers), which can distort the resulting embeddings. Furthermore, the theoretical connection between CA's objective function and PMI-based factorization needs formalization.

2. Methodology

2.1 Theoretical Framework

The authors establish a formal mathematical link between CA and PMI-based methods:

CA Objective: CA performs Singular Value Decomposition (SVD) on a matrix of standardized residuals: $\frac{p_{ij} - p_{i+}p_{+j}}{\sqrt{p_{i+}p_{+j}}}$ . This can be viewed as a weighted matrix factorization of $(\frac{p_{ij}}{p_{i+}p_{+j}} - 1)$ with weights $p_{i+}p_{+j}$ .
PMI Objective: PMI-based methods factorize $\log(\frac{p_{ij}}{p_{i+}p_{+j}})$ .
The Connection: Using a Taylor expansion, the authors show that when deviations from independence are small ( $\frac{p_{ij}}{p_{i+}p_{+j}} - 1 \approx \epsilon$ ), $\log(1+\epsilon) \approx \epsilon$ . Thus, CA is mathematically equivalent to a weighted factorization of the PMI matrix, where the weighting function is the product of marginal probabilities ( $p_{i+}p_{+j}$ ).

2.2 Proposed Variants

To address overdispersion and variance instability in word-count data (modeled as Poisson variables), the authors introduce two new CA variants:

ROOT-CA: Applies a square-root transformation ( $\sqrt{x_{ij}}$ ) to the word-context matrix before performing CA. This stabilizes the variance of Poisson-distributed counts.
ROOTROOT-CA: Applies a fourth-root transformation ( $\sqrt[4]{x_{ij}}$ ) before CA. This is a standard ecological technique to handle severe overdispersion in contingency tables.
PMI-GSVD: A new baseline introduced to test the weighted factorization theory. It applies Generalized SVD (GSVD) to the PMI matrix using the marginal product ( $p_{i+}p_{+j}$ ) as the weighting function.

2.3 Experimental Setup

Corpora: Text8 (11k vocab), British National Corpus (BNC, 11k vocab), and a large Wikipedia subset (Wiki052024, 15k vocab).
Baselines:
- SVD-based: PMI-SVD, PPMI-SVD, PMI-GSVD, CA (RAW), ROOT-CA, ROOTROOT-CA, ROOT-CCA.
- Neural/Iterative: GloVe, SGNS.
- Contextual: BERT (Pre-trained and Fine-tuned).
Evaluation: Word similarity tasks using four standard datasets (WordSim353, MEN, Turk, SimLex-999). Performance is measured via Spearman's correlation coefficient ( $\rho$ ) between model similarity and human ratings.
Analysis: The authors analyze the contribution of extreme values (outliers) to the first few dimensions of the decomposed matrices to explain performance differences.

3. Key Contributions

Theoretical Unification: The paper proves that CA is a weighted factorization of the PMI matrix, bridging the gap between classical multivariate statistics and modern NLP embedding techniques.
Novel Variants: Introduction of ROOT-CA and ROOTROOT-CA to the NLP domain. These methods leverage power transformations to mitigate overdispersion, a problem often ignored in standard word embedding pipelines.
Insight into Extreme Values: The study identifies that extreme values in the decomposed matrix are the primary cause of performance degradation in methods like RAW-CA and PMI-GSVD. These outliers dominate the first singular vectors, skewing the embeddings.
Re-evaluation of Weighting: The results challenge the "reliability principle" (which suggests weighting by marginal frequencies improves performance). The authors found that PMI-SVD (unweighted) outperformed PMI-GSVD (weighted by marginals) because the weighting in GSVD amplified the impact of extreme values in the WPMI matrix.

4. Results

Performance of Variants:
- ROOT-CA and ROOTROOT-CA consistently outperformed standard CA (RAW-CA) and often surpassed PMI-based methods (PPMI-SVD, GloVe, SGNS).
- ROOTROOT-CA achieved the highest overall performance on the Text8 and Wiki052024 corpora.
- ROOT-CA performed best on the BNC corpus.
Comparison with BERT:
- While BERT (specifically the first layer) performed very well, ROOT-CA and ROOTROOT-CA achieved competitive results, sometimes outperforming BERT on specific datasets (e.g., the Turk dataset).
- This suggests that for static word similarity tasks, simpler, non-transformer methods can rival complex contextual models.
Failure of Weighted PMI (PMI-GSVD):
- Despite theoretical appeal, PMI-GSVD performed worse than unweighted PMI-SVD.
- Reason: The weighting function ( $p_{i+}p_{+j}$ ) in PMI-GSVD amplified extreme values (e.g., common words like "the" or rare co-occurrences), causing the first dimensions to be dominated by single words rather than semantic clusters.
Impact of Transformations:
- The square-root and fourth-root transformations significantly reduced the number and magnitude of extreme values in the matrices, leading to more balanced dimension contributions and better semantic representation.

5. Significance and Implications

Alternative to Transformers: The study demonstrates that static word embeddings are not obsolete. For tasks requiring interpretability, low computational resources, or specific semantic similarity measurements, CA-based methods (especially ROOT-CA/ROOTROOT-CA) offer a highly efficient and effective alternative to large transformer models.
Handling Overdispersion: The paper highlights that word-count data is inherently overdispersed. Applying power transformations (like the 4th root) before dimensionality reduction is a crucial, often overlooked step for optimizing SVD-based embeddings.
Guidance for Future Research: The findings suggest that future improvements to SVD-based methods should focus on controlling extreme values rather than simply applying complex weighting schemes. It also opens the door for exploring general power transformations ( $x^\delta$ ) for word-context matrices.
Interpretability: Unlike "black box" neural networks, CA-based embeddings are derived from transparent statistical operations on co-occurrence counts, making them more interpretable for domains like law and medicine.

In conclusion, the paper successfully revitalizes Correspondence Analysis for NLP, showing that with appropriate data transformations (ROOT/ROOTROOT), it can outperform established PMI-based methods and compete with modern BERT embeddings, all while offering greater computational efficiency and interpretability.