Disentangling Similarity and Relatedness in Topic Models

Imagine you are trying to organize a massive, chaotic library. You want to group books into "topics" so people can find what they need. This is what Topic Models do for text data: they look at thousands of documents and try to figure out what the main themes are.

For a long time, computers did this by looking at co-occurrence: "If the word 'coffee' often appears right next to 'mug' and 'cup,' let's group them together." This is like organizing books by who sits next to whom at a dinner party.

But recently, computers got smarter. They started using Large Language Models (LLMs)—the same tech behind chatbots—to understand words based on their meaning rather than just who they hang out with. Now, if you ask a computer about "coffee," it knows it's similar to "tea" because they are both hot drinks, even if they rarely appear in the same sentence.

The Problem:
The authors of this paper realized that these two methods (the old "dinner party" way and the new "smart meaning" way) are doing two very different things, but we didn't have a good way to tell them apart.

Method A finds Relatedness: Words that go together in a story (e.g., Doctor and Hospital). They are different things, but they work together.
Method B finds Similarity: Words that are basically the same thing (e.g., Doctor and Physician). They are interchangeable.

The paper argues that we need to stop treating these as the same thing. Depending on what you want the computer to do, you might want one type of grouping or the other.

The Solution: A New "Taste Tester"

To fix this, the researchers built a Neural Scorer (think of it as a super-smart taste tester).

Training the Taste Tester: They couldn't find enough human experts to label millions of word pairs, so they used a powerful AI (DeepSeek) to create a massive dataset of 51,000 word pairs. They taught the AI to distinguish between "These words are synonyms" (Similarity) and "These words are partners in crime" (Relatedness).
The Test: They took 13 different topic models (the library organizers) and ran them on 6 different types of text (news, scientific papers, forum posts).
The Result: They plotted every model on a map.
- Old-school models (like LDA) clustered near the "Relatedness" side. They are great at finding words that belong in the same story.
- New-school models (using BERT/LLMs) clustered near the "Similarity" side. They are great at finding words that mean the same thing.

Why Does This Matter? (The "Right Tool for the Job" Analogy)

The paper shows that knowing which side a model leans toward helps you predict how well it will do at specific tasks.

Task 1: Event Monitoring (The "Detective" Job)
- Goal: You want to know if a news article is about a "bank robbery." You need to see words like police, money, vault, alarm appearing together.
- Winner: The Relatedness models. They understand that these different words form a coherent story.
- Loser: The Similarity models might get confused, thinking "bank" (river) is similar to "bank" (money) without realizing they are in a robbery context.
Task 2: Finding Synonyms (The "Translator" Job)
- Goal: You want to find all documents that mention "rise" or "increase."
- Winner: The Similarity models. They know these words are interchangeable and group them tightly.
- Loser: The Relatedness models might treat them as just two different words that happen to appear in similar contexts, missing the direct link.

The Takeaway

Think of Topic Models like different types of cameras:

Some cameras are Wide-Angle Lenses. They capture the whole scene and how objects relate to each other (Relatedness). Great for landscape photos (understanding a story).
Some cameras are Zoom Lenses. They focus on specific details and how similar they look (Similarity). Great for macro photos (finding synonyms).

For years, we just asked, "Which camera takes the sharpest picture?" This paper says, "Wait, you need to ask, 'What kind of picture are you trying to take?'"

By measuring Similarity and Relatedness separately, we can finally choose the right "camera" for the job, rather than just guessing which model is "better." This makes AI much more useful for real-world tasks, from organizing news feeds to helping doctors find relevant medical research.

Here is a detailed technical summary of the paper "Disentangling Similarity and Relatedness in Topic Models".

1. Problem Statement

Traditional topic models (e.g., LDA) and modern Pre-trained Language Model (PLM)-augmented topic models (e.g., BERTopic, TNTM) are typically evaluated using metrics like coherence, diversity, and embedding-based similarity. However, these metrics fail to capture a fundamental semantic distinction:

Thematic Relatedness: Associative connections between words that co-occur in context but belong to different categories (e.g., coffee–mug, doctor–hospital). Traditional co-occurrence models excel here.
Taxonomic Similarity: Substitutional closeness between words belonging to the same category (e.g., coffee–tea, rise–increase). PLM-augmented models, relying on embedding proximity, tend to capture this.

Existing evaluation metrics conflate these two dimensions or focus on only one, making it difficult to understand why certain models perform better on specific downstream tasks. There is a lack of a unified framework to quantify and disentangle these two semantic axes.

2. Methodology

A. Construction of a Synthetic Benchmark

To train a robust evaluator, the authors constructed a large-scale synthetic dataset of 51,523 word pairs.

Data Sources: They sampled from knowledge networks including WordNet (synonyms, co-hyponyms), ConceptNet (associative edges like PartOf, UsedFor), and BERT embedding space nearest neighbors.
Annotation: Instead of relying on small, noisy human-annotated datasets, they used DeepSeek-V3 (an LLM) to annotate the pairs. The prompt explicitly instructed the LLM to distinguish between similarity (substitution) and relatedness (association/complementarity), providing clear scoring anchors and rules.
Result: A balanced dataset covering the full spectrum of the similarity–relatedness space.

B. Neural Scorer Architecture

The authors trained a neural network to predict two continuous scores (0–1) for any given word pair:

Input: GloVe 300-dimensional embeddings for the word pair, combined with interaction features ( $e_1+e_2$ , $e_1-e_2$ , $e_1 \odot e_2$ , $|e_1-e_2|$ ) and an 8-dimensional vector derived from WordNet features.
Architecture: A Multi-Layer Perceptron (MLP) with a shared trunk and two output heads (one for similarity, one for relatedness).
Regularization: A penalty term was added to the loss function to encourage the model to learn a meaningful separation between the two dimensions.
Validation: The scorer was validated against the TxThmNorms benchmark and showed superior performance in disentangling the two dimensions compared to BERT cosine similarity and standard WordNet metrics.

C. Topic Model Atlas and Evaluation

The authors applied this scorer to a comprehensive "Atlas" of 13 topic models across 6 diverse corpora (Reuters, 20NewsGroups, ACL, DBLP, M10, BBC).

Metric: For each topic (top 5 words), they computed the average pairwise similarity and relatedness scores.
Gap Metric: They defined a "Shifted Normalized Gap" (Relatedness – Similarity) to position models on a spectrum. Positive values indicate a relatedness-leaning profile; negative values indicate a similarity-leaning profile.

D. Downstream Task Prediction

To test the predictive power of these metrics, they evaluated models on three downstream tasks:

Event Monitoring (Reuters): Requires recognizing co-occurring themes (favors Relatedness).
Category Retrieval (Reuters): Requires finding semantically similar documents (favors Relatedness).
Synonym Pair Retrieval: Requires identifying interchangeable words (favors Similarity).

3. Key Contributions

Two-Dimensional Evaluation Framework: Introduced a novel method to explicitly disentangle taxonomic similarity from thematic relatedness in topic model outputs, addressing a gap in semantic evaluation.
Large-Scale Synthetic Benchmark: Created a 51K+ word-pair dataset with LLM-imputed annotations, overcoming the scarcity of high-quality, dual-dimension human benchmarks.
Predictive Power of Semantic Axes: Demonstrated that the similarity–relatedness profile is a stable, intrinsic property of a model's architecture (co-occurrence models lean toward relatedness; PLM models lean toward similarity) and that this profile predicts downstream task performance.
Diagnostic Utility: Showed that existing metrics (coherence/diversity) often fail to explain performance differences, whereas the new 2D metric successfully identifies model strengths and weaknesses (e.g., explaining why BERT-KT fails at event monitoring despite high similarity scores).

4. Key Results

Model Behavior:
- Co-occurrence models (LDA, NMF, NeuralLDA): Consistently produce topics with high Relatedness and lower Similarity. They capture thematic coherence (e.g., dollar, rate, exchange).
- PLM-Augmented models (BERTopic, TNTM, BERT-KT): Consistently produce topics with high Similarity and lower Relatedness. They capture categorical clustering (e.g., rise, increase, fall, decline).
- Consistency: This behavior is robust across different corpora (Kendall's W > 0.57), indicating it is driven by model architecture rather than data characteristics.
Downstream Performance Correlation:
- Relatedness-favoring tasks (Event Monitoring, Category Retrieval): Traditional co-occurrence models outperformed PLM models. Regression analysis showed Relatedness was a significant predictor of performance ( $R^2 \approx 0.42–0.48$ ).
- Similarity-favoring tasks (Synonym Retrieval): PLM models outperformed traditional models. Similarity was the strongest predictor ( $R^2 = 0.46$ ).
- Failure of Standard Metrics: Coherence and Diversity scores showed low explanatory power and inconsistent correlations with task performance.
Case Studies:
- BERTopic: Despite being PLM-based, it scored high on relatedness because it uses c-TF-IDF for word extraction, which favors recurring thematic terms over synonyms.
- BERT-KT: Showed artificially high similarity due to clustering generic register words (e.g., told, said) that are close in embedding space but semantically vacuous.
- ECRTM: Showed low relatedness because its regularizer pushed it to cluster rare, isolated entities (e.g., specific names) that lack co-occurrence.

5. Significance and Implications

Task-Aware Model Selection: Practitioners can now select topic models based on the semantic nature of their downstream task. If the goal is broad thematic recall (e.g., news categorization), a relatedness-inclined model is preferred. If the goal is precise semantic matching (e.g., synonym expansion), a similarity-inclined model is better.
Architectural Debugging: The metric serves as a diagnostic tool to identify "failure modes" in topic models, such as generic-topic artifacts or rare-token clustering, which standard coherence metrics miss.
Corpus Characterization: The framework can analyze the semantic structure of a corpus itself (e.g., identifying if a corpus is dominated by technical synonyms vs. thematic associations), guiding preprocessing decisions.
Future Directions: The authors note limitations in vocabulary generalizability (especially for highly specialized domains) and the need for multilingual extensions, but the proposed pipeline provides a reliable foundation for future semantic evaluation.

In conclusion, the paper establishes that similarity and relatedness are distinct, orthogonal axes of topic model behavior. Ignoring this distinction leads to suboptimal model selection, while explicitly measuring it provides a powerful, predictive framework for evaluating and tuning topic models.

Disentangling Similarity and Relatedness in Topic Models

The Solution: A New "Taste Tester"

Why Does This Matter? (The "Right Tool for the Job" Analogy)

The Takeaway

1. Problem Statement

2. Methodology

A. Construction of a Synthetic Benchmark

B. Neural Scorer Architecture

C. Topic Model Atlas and Evaluation

D. Downstream Task Prediction

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

ConFu: Contemplate the Future for Better Speculative Sampling

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance