SemBench: A Universal Semantic Framework for LLM Evaluation

Imagine you have a giant, super-smart robot that can write poems, write code, and chat like a human. You want to know: Does this robot actually understand what it's saying, or is it just guessing based on patterns it saw in its training data?

This is the problem the paper "SemBench" tries to solve.

The Old Way: The "Human Teacher" Test

Traditionally, to test if a robot understands words, researchers use a method called WiC (Word-in-Context).

Think of this like a teacher giving a student a quiz. The teacher writes two sentences:

"I went to a party to dance."
"The party lost the election."

The student (the AI) has to answer: "Do these two 'parties' mean the same thing?" (The answer is no; one is a celebration, the other is a political group).

The Problem: Creating these quizzes is hard work.

It's expensive: You need human experts to write thousands of unique sentences.
It's slow: It takes forever to make a quiz for a new language.
It's limited: If you want to test a rare language (like Basque), you might not have enough human-written quizzes to do it.

The New Way: SemBench (The "Dictionary Detective")

The authors of this paper, Mikel and his team, invented SemBench. Instead of asking humans to write quizzes, they built a machine that makes its own quizzes using only a dictionary and a sentence encoder (a tool that measures how similar two sentences are).

Here is how SemBench works, using a creative analogy:

The Analogy: The "Shape-Shifting Translator"

Imagine you have a magic dictionary. You pick a word, say "Bank".

Pick a Sense: The dictionary says "Bank" has two meanings: (A) A place to keep money, and (B) The side of a river.
The AI's Job: The AI is asked to act like a translator.
- Step 1: The AI sees the definition of "River Bank" and must invent a sentence using it. (e.g., "We sat on the muddy bank.")
- Step 2: The AI then takes that new sentence and must write a definition for it.
The Test: The AI's new definition is compared to the original dictionary definition.
- If the AI wrote a definition that sounds like "River Bank," it passed.
- If it wrote something that sounds like "Money Bank," it failed.

If the AI can smoothly switch back and forth between Definitions and Examples without getting confused, it proves it truly understands the word's meaning.

Why is this a Big Deal?

1. It's "Language-Agnostic" (Universal)
You don't need a team of linguists in every country. As long as you have a dictionary (even a simple one) for a language, you can run this test. The authors tested it on English (rich resources), Spanish (medium resources), and Basque (very few resources). It worked for all of them.

2. It's a "Lightweight" Test
You don't need a massive dataset. The paper shows that you only need about 250 to 500 examples to get a very clear, reliable ranking of which AI is smarter. It's like testing a runner with a short sprint instead of a full marathon.

3. It's Harder to Cheat
Standard tests often have patterns that AIs can memorize. SemBench is dynamic. Because the AI has to generate the sentence and then reverse-engineer the definition, it's much harder for the AI to just "guess" the right answer. It has to actually understand the logic.

The Results: Did it Work?

The team compared their new "Dictionary Detective" test against the old "Human Teacher" test (WiC).

The Verdict: The results matched almost perfectly! If an AI did well on the old test, it did well on the new one.
The Bonus: SemBench was actually better at telling the difference between a "good" AI and a "great" AI. It could spot subtle differences in intelligence that the old tests missed.
The Low-Resource Win: In Basque (a language with very few digital resources), the new test could still tell which AI was specialized for that language and which was just guessing. The old test struggled here.

The Takeaway

SemBench is like a universal translator's toolkit. Instead of waiting for humans to write thousands of difficult quizzes for every language in the world, we can now use a dictionary and a computer to automatically generate a test that checks if an AI truly understands the meaning of words.

It's cheaper, faster, works for almost any language, and gives us a clearer picture of how "smart" our AI assistants really are.

1. Problem Statement

Despite the rapid advancement of Large Language Models (LLMs) in generative and reasoning tasks, evaluating their true semantic understanding remains a significant challenge.

Limitations of Current Benchmarks: Traditional benchmarks like Word-in-Context (WiC) effectively test a model's ability to distinguish word senses based on context. However, creating WiC datasets is resource-intensive, often requiring manual annotation by linguistic experts.
Scalability and Language Bias: Existing datasets are predominantly available for high-resource languages (e.g., English). Creating them for low-resource languages is difficult due to a lack of curated examples and licensing restrictions on dictionaries.
Need for Automation: There is a critical need for an evaluation framework that is automatic, scalable, language-independent, and does not rely on manually curated example sentences.

2. Methodology: SemBench

The authors propose SemBench, a fully automatic framework that generates synthetic benchmarks to assess semantic competence using only dictionary sense definitions and a sentence encoder.

Core Workflow

The framework operates on the intuition that a model with genuine semantic understanding can consistently transition between a word's definition and its usage examples. The process involves four steps (illustrated in Figure 1 of the paper):

Selection: A polysemous word $w$ is selected from a dictionary, and one specific sense $s_i$ (with definition $d_i$ ) is chosen as the seed.
Generation (Example): The LLM is prompted to generate a usage example sentence ( $e'_i$ ) based on the definition $d_i$ and the part-of-speech (PoS).
Generation (Definition): The LLM is then prompted to generate a new dictionary definition ( $d'_i$ ) based on the generated example $e'_i$ and the PoS.
Evaluation: The generated definition $d'_i$ $d_{i}^{'}$ is compared against two reference definitions from the dictionary:
- Target ( $d_i$ ): The original definition corresponding to the intended sense.
- Distractor ( $d_j$ ): A definition corresponding to a different sense of the same word.
- Metric: A sentence encoder computes the semantic similarity (dot product of embeddings). The model is considered correct if $sim(d'_i, d_i) > sim(d'_i, d_j)$ .

Experimental Configurations

SemBench offers two modes depending on resource availability:

SemBenchDef (Default): Assumes only definitions are available. The model must generate an example and then a definition (two-step generation).
SemBenchEx: Assumes the dictionary provides an example sentence. The model only needs to generate a definition from the example (one-step generation).

Difficulty Control

To characterize the evaluation space, the framework controls difficulty by selecting distractor definitions based on their semantic similarity to the target:

Easy: Least similar definition.
Mid: Middle of the similarity list.
Hard: Most similar definition.
Rand: Randomly selected.

3. Key Contributions

Fully Automatic Framework: SemBench eliminates the need for manual annotation or curated example sentences, relying solely on dictionary definitions (widely available) and a sentence encoder.
Cross-Lingual Adaptability: The framework was validated across three typologically diverse languages with varying resource levels:
- English: High-resource (Oxford Dictionary).
- Spanish: Moderate-resource (RAE Dictionary).
- Basque: Low-resource (EEH Dictionary).
Data Efficiency: The study demonstrates that stable and meaningful model rankings can be achieved with a very small number of test instances (as few as 250–500), making it highly efficient.
Difficulty Heuristic: A simple mechanism to control task complexity (Easy/Med/Hard) that accurately reflects semantic difficulty while maintaining high correlation with standard benchmarks.

4. Experimental Results

The authors evaluated a diverse set of LLMs (Gemma, Qwen, Llama, Latxa) across different sizes and architectures.

Correlation with WiC:
- SemBench rankings show a strong correlation with standard WiC benchmarks.
- English: $\rho = 0.930$ (SemBenchDef) and $\rho = 0.911$ (SemBenchEx).
- Spanish: $\rho = 0.765$ .
- Basque: $\rho = 0.657$ (statistically significant, though lower due to near-random performance of general models on Basque WiC).
Discriminative Power: SemBench exhibits a wider range of scores across models compared to WiC, which tends to cluster high-performing models tightly. This suggests SemBench is better at distinguishing subtle differences in semantic competence.
Robustness:
- Sample Size: Correlation stabilizes rapidly; 500 instances are sufficient for reliable rankings.
- Zero-Shot vs. Few-Shot: While few-shot (5 examples) yields slightly higher correlation, the framework remains effective in zero-shot settings, particularly in the SemBenchEx variant.
Model Performance Trends:
- Larger models generally outperform smaller ones.
- Reasoning-focused models (e.g., Qwen3) showed superior performance.
- Low-Resource Success: In Basque, language-specialized models (Latxa) outperformed general-purpose models, a distinction SemBench captured more effectively than WiC.

5. Significance and Conclusion

SemBench represents a paradigm shift in LLM evaluation by moving away from static, manually curated datasets toward dynamic, generation-based evaluation.

Scalability: It enables the evaluation of semantic understanding in low-resource languages where traditional benchmarks do not exist, provided a dictionary with sense definitions is available.
Efficiency: It requires minimal human intervention (only the initial dictionary and encoder setup) and small test sets to produce reliable results.
Validity: The strong correlation with WiC validates that generating definitions and examples is a robust proxy for measuring semantic understanding.

The paper concludes that SemBench is a lightweight, adaptable, and data-efficient tool that can be readily applied to new languages and domains, addressing a critical bottleneck in the evaluation of multilingual LLMs.