Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering

Imagine you are a librarian in a massive, chaotic library where books are constantly being rewritten, translated into different languages, and thrown onto the shelves by people who don't know how to organize them. This library is the internet, and the "books" are claims (statements like "The moon is made of cheese" or "This new vaccine cures the flu").

The problem? The same claim keeps showing up over and over again, sometimes in English, sometimes in Spanish, sometimes with slightly different wording. If a fact-checker tries to verify every single version of this claim individually, they will burn out. They need a way to say, "Wait, these 50 different sentences are all talking about the exact same thing. Let's check them all at once."

This is where the paper "Claim2Vec" comes in.

The Problem: The "Lost in Translation" Mess

The authors explain that current computer tools are like librarians who are very literal. If you have a book titled "Heart Attack" in English and another titled "Heart Poison" in Spanish, a standard computer might think they are two completely different stories because the words are different.

Even worse, if the computer sees "Heart Attack" in English and "Heart Attack" in Serbian, it might still think they are different because the computer doesn't "feel" that they mean the same thing across languages. This leads to the computer creating too many tiny, separate piles for things that should be one big pile.

The Solution: Claim2Vec (The "Universal Translator" Librarian)

The researchers created a new tool called Claim2Vec. Think of this as training a super-smart librarian who doesn't just read words, but understands the soul of the sentence.

Here is how they built it:

The Training Camp: They took a pre-existing smart librarian (a model called BGE-M3) and gave it a special training camp.
The Exercise: They showed the librarian pairs of sentences that meant the same thing but were written in different languages or with different words (e.g., "The sky is blue" and "El cielo es azul").
The Lesson: They told the librarian, "These two are twins! Move them closer together on the shelf." They also told it, "These two are strangers! Push them far apart."

This process is called Contrastive Learning. Imagine a dance floor where the goal is to find your dance partner. The computer learns to pull similar claims together and push different ones apart, creating a perfect map where everything that belongs together is in the same neighborhood.

The Results: A Neatly Organized Library

The team tested this new librarian against 14 other existing tools. Here is what happened:

Better Grouping: When they asked the computer to sort thousands of claims into groups, Claim2Vec did a much better job. It stopped splitting up twins and started grouping them correctly.
The "Shape" of the Data: The researchers looked at the "shape" of the data. Before, the claims were scattered like marbles in a bowl. After using Claim2Vec, the similar claims clumped together like magnets, making it much easier for a computer to see the groups.
Cross-Language Magic: The most impressive part? It worked best when the groups contained claims in multiple languages. This proves the librarian learned to translate the meaning, not just the words. It successfully transferred knowledge from English to Spanish, French to German, and so on.

Why Does This Matter?

In the real world, misinformation spreads like wildfire. A fake story might start in one country, get translated, and spread to ten others.

Without Claim2Vec: Fact-checkers have to check the story 10 times in 10 different languages.
With Claim2Vec: The system sees all 10 versions as one single "cluster." The fact-checker verifies it once, and the system automatically flags all 10 versions as "Checked."

The Bottom Line

The paper introduces Claim2Vec, a specialized tool that teaches computers to understand that "a heart attack" and "un infarto al corazón" are the same story. By doing this, it helps fact-checkers work faster, smarter, and across all languages, turning a chaotic pile of misinformation into organized, verifiable groups.

In short: It's the difference between a librarian who sorts books by the color of the cover, and one who sorts them by the story inside.

1. Problem Statement

Automated fact-checking systems face a significant challenge due to the rapid recirculation of misinformation, where the same claims reappear in different forms and languages. While existing tasks like claim matching (pairwise similarity) and claim retrieval address this, they are inefficient at scale. The more effective approach is claim clustering, which groups semantically similar claims that can be resolved by a single fact-check.

However, claim clustering remains underexplored, particularly in multilingual settings. Current state-of-the-art multilingual text embedding models (e.g., BGE-M3, LaBSE) are designed for general-purpose semantic similarity or retrieval. They often fail to capture the specific nuances of fact-check claims, leading to:

Lexical Sensitivity: Failing to recognize semantic equivalence when phrasing differs (e.g., "heart attack" vs. "heart poison").
Cross-lingual Variation: Incorrectly separating semantically identical claims written in different languages due to embedding distance.
Suboptimal Clustering: Producing fragmented clusters (over-segmentation) or incorrect merges, hindering automated verification.

2. Methodology

The authors propose Claim2Vec, a specialized multilingual embedding model optimized for fact-check claim clustering. The methodology consists of two main stages:

A. Data Preparation

Datasets: The study utilizes the MultiClaimNet suite (ClaimCheck, ClaimMatch, and MultiClaim), containing over 85k fact-checked claims in 78 languages.
Training/Testing Split: To ensure robust evaluation, the largest dataset (MultiClaim) is partitioned by topic using Latent Dirichlet Allocation (LDA):
- MultiClaim-Train: Global political issues (used for fine-tuning).
- MultiClaim-Test: COVID-19 discussions (used for evaluation).
Positive Pairs: The model is trained on 28,000 pairs of semantically similar claims derived from ground-truth clusters.

B. Model Architecture & Training (Claim2Vec)

Base Model: The authors start with BGE-M3, a retrieval-oriented multilingual embedding model.
Fine-tuning Strategy: They employ Contrastive Learning using Multiple Negatives Ranking Loss (MNRL).
- Objective: Pull semantically similar claims closer together in the embedding space while pushing dissimilar claims apart.
- Negative Sampling: Uses in-batch negatives (other claims in the same batch) as implicit negative samples.
- Hyperparameters: 1 epoch, batch size 32, learning rate $1 \times 10^{-5}$ .
Clustering Algorithm: Claims are encoded into vectors and clustered using Agglomerative Clustering.
- Threshold Selection: Instead of manual thresholding, the optimal distance threshold is automatically selected by maximizing the Silhouette Score (balancing intra-cluster cohesion and inter-cluster separation).

3. Key Contributions

Claim2Vec Model: The first multilingual embedding model specifically optimized for fact-check claims, enabling better representation of semantically similar claims across languages.
Contrastive Learning Application: Demonstrates that fine-tuning general multilingual encoders with claim-specific pairs significantly enhances semantic alignment and facilitates cross-lingual knowledge transfer.
Comprehensive Evaluation: A rigorous benchmark involving 3 datasets, 7 clustering algorithms, and 14 baseline multilingual embedding models (ranging from 135M to 9B parameters).

4. Experimental Results

The study evaluates performance using Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), Homogeneity, Completeness, V-Measure, and Silhouette Score (SS).

Superior Performance: Claim2Vec consistently outperforms all 14 baseline models (including BGE-M3, GTE, and LaBSE) across all three datasets.
- ClaimCheck: ARI improved from 0.845 (BGE-M3) to 0.906.
- ClaimMatch: ARI improved from 0.644 to 0.762.
- MultiClaim-Test: ARI improved from 0.610 to 0.626, with a significant boost in Silhouette Score (0.229 $\to$ 0.284).
Error Reduction:
- Split Errors: Claim2Vec significantly reduces "split errors" (where one ground-truth cluster is incorrectly divided). For example, on MultiClaim-Test, split errors dropped from 8,474 (BGE-M3) to 6,383.
- Mismerge Errors: While split errors were the primary target, mismerge errors also decreased, though less dramatically.
Robustness: Claim2Vec maintains superior performance across varying cluster configurations (different numbers of clusters), indicating that the geometric structure of the embedding space is more robust than the base model.
Multilingual Analysis:
- Cross-lingual Transfer: Clusters containing claims in multiple languages showed the highest performance gains (in ARI and AMI), proving effective cross-lingual knowledge transfer.
- Language Disparity: High-resource languages (English, Spanish, French) showed lower error rates, while low-resource languages (e.g., Macedonian, Finnish) still exhibited higher error rates, suggesting a need for more diverse pretraining data.

5. Significance and Conclusion

Efficiency: By accurately clustering claims, fact-checkers can verify a group of claims with a single check, drastically reducing the workload in combating misinformation.
Domain Specificity: The paper proves that general-purpose embedding models are insufficient for fact-checking; domain-specific fine-tuning is essential to handle lexical variations and cross-lingual nuances.
Scalability: The automated threshold selection strategy (Silhouette Score) makes the system practical for real-world deployment without requiring ground-truth labels for tuning.

Limitations: The current work assumes each claim expresses a single factual statement (multi-fact claims are not handled), relies on BGE-M3 as the sole base encoder, and is limited by the availability of multilingual datasets for certain low-resource languages. Future work will explore applying Claim2Vec to claim retrieval tasks.

Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering

The Problem: The "Lost in Translation" Mess

The Solution: Claim2Vec (The "Universal Translator" Librarian)

The Results: A Neatly Organized Library

Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology

A. Data Preparation

B. Model Architecture & Training (Claim2Vec)

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Self-Calibrating Language Models via Test-Time Discriminative Distillation

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

Generating High Quality Synthetic Data for Dutch Medical Conversations

GIANTS: Generative Insight Anticipation from Scientific Literature