Disentangling Similarity and Relatedness in Topic Models

This paper introduces a neural scoring function trained on an LLM-annotated synthetic benchmark to disentangle taxonomic similarity and thematic relatedness in topic models, demonstrating that these distinct semantic dimensions not only characterize differences across model families but also predict downstream task performance.

Hanlin Xiao, Mauricio A. Álvarez, Rainer Breitling

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to organize a massive, chaotic library. You want to group books into "topics" so people can find what they need. This is what Topic Models do for text data: they look at thousands of documents and try to figure out what the main themes are.

For a long time, computers did this by looking at co-occurrence: "If the word 'coffee' often appears right next to 'mug' and 'cup,' let's group them together." This is like organizing books by who sits next to whom at a dinner party.

But recently, computers got smarter. They started using Large Language Models (LLMs)—the same tech behind chatbots—to understand words based on their meaning rather than just who they hang out with. Now, if you ask a computer about "coffee," it knows it's similar to "tea" because they are both hot drinks, even if they rarely appear in the same sentence.

The Problem:
The authors of this paper realized that these two methods (the old "dinner party" way and the new "smart meaning" way) are doing two very different things, but we didn't have a good way to tell them apart.

  • Method A finds Relatedness: Words that go together in a story (e.g., Doctor and Hospital). They are different things, but they work together.
  • Method B finds Similarity: Words that are basically the same thing (e.g., Doctor and Physician). They are interchangeable.

The paper argues that we need to stop treating these as the same thing. Depending on what you want the computer to do, you might want one type of grouping or the other.

The Solution: A New "Taste Tester"

To fix this, the researchers built a Neural Scorer (think of it as a super-smart taste tester).

  1. Training the Taste Tester: They couldn't find enough human experts to label millions of word pairs, so they used a powerful AI (DeepSeek) to create a massive dataset of 51,000 word pairs. They taught the AI to distinguish between "These words are synonyms" (Similarity) and "These words are partners in crime" (Relatedness).
  2. The Test: They took 13 different topic models (the library organizers) and ran them on 6 different types of text (news, scientific papers, forum posts).
  3. The Result: They plotted every model on a map.
    • Old-school models (like LDA) clustered near the "Relatedness" side. They are great at finding words that belong in the same story.
    • New-school models (using BERT/LLMs) clustered near the "Similarity" side. They are great at finding words that mean the same thing.

Why Does This Matter? (The "Right Tool for the Job" Analogy)

The paper shows that knowing which side a model leans toward helps you predict how well it will do at specific tasks.

  • Task 1: Event Monitoring (The "Detective" Job)

    • Goal: You want to know if a news article is about a "bank robbery." You need to see words like police, money, vault, alarm appearing together.
    • Winner: The Relatedness models. They understand that these different words form a coherent story.
    • Loser: The Similarity models might get confused, thinking "bank" (river) is similar to "bank" (money) without realizing they are in a robbery context.
  • Task 2: Finding Synonyms (The "Translator" Job)

    • Goal: You want to find all documents that mention "rise" or "increase."
    • Winner: The Similarity models. They know these words are interchangeable and group them tightly.
    • Loser: The Relatedness models might treat them as just two different words that happen to appear in similar contexts, missing the direct link.

The Takeaway

Think of Topic Models like different types of cameras:

  • Some cameras are Wide-Angle Lenses. They capture the whole scene and how objects relate to each other (Relatedness). Great for landscape photos (understanding a story).
  • Some cameras are Zoom Lenses. They focus on specific details and how similar they look (Similarity). Great for macro photos (finding synonyms).

For years, we just asked, "Which camera takes the sharpest picture?" This paper says, "Wait, you need to ask, 'What kind of picture are you trying to take?'"

By measuring Similarity and Relatedness separately, we can finally choose the right "camera" for the job, rather than just guessing which model is "better." This makes AI much more useful for real-world tasks, from organizing news feeds to helping doctors find relevant medical research.