NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating

Imagine you are a judge at a talent show, but instead of singing or dancing, the contestants are words.

Specifically, these are words that have double meanings (like "bank," which could be a place to keep money or the side of a river). Your job is to read a short, five-sentence story and decide: "How likely is it that the word is being used in this specific meaning?"

You have to give a score from 1 to 5:

1: "No way! That makes zero sense here."
5: "Absolutely! That is exactly what the story means."

This is the challenge of SemEval-2026 Task 5, and a team of researchers (NCL-UoR) built three different types of "AI Judges" to solve it. Here is how they did it, explained simply.

The Three Contenders

The team tried three different ways to teach their AI how to judge these stories.

1. The "Mathematical Matchmaker" (Embedding-Based Methods)

The Analogy: Imagine you have a library of books. You take the story and the word meaning, turn them both into a single "fingerprint" (a list of numbers), and then measure how close those fingerprints are. If the fingerprints are close, the AI thinks the meaning fits.

How it worked: They used a ruler to measure the distance between the story's "fingerprint" and the word's "fingerprint."
The Result: It was like trying to solve a complex mystery by only looking at the color of the suspect's shoes. It failed miserably. The AI couldn't understand the story; it only saw that the words were vaguely similar. It got a very low score.

2. The "Student Who Memorized the Textbook" (Fine-Tuning)

The Analogy: This is like taking a very smart student (a pre-trained AI model) and forcing them to study thousands of these specific stories until they memorized the patterns. We gave them special tools (called LoRA) to help them learn faster without forgetting everything else they know.

How it worked: We showed the AI thousands of examples and said, "When you see this setup, give a score of 3. When you see that ending, give a 5."
The Result: This student did much better than the Matchmaker. They understood the context well. However, when they faced a new type of story they hadn't seen before, they got a bit confused. They were too rigid, relying on what they memorized rather than thinking flexibly.

3. The "Structured Detective" (LLM Prompting)

The Analogy: This is the winner. Instead of forcing the AI to memorize, we gave it a checklist and a rulebook. We told the AI: "Don't just guess. Act like a detective. Break the story into three parts: The Setup, The Clue, and The Conclusion. Check each part against the rulebook, then give your verdict."

The Strategy:
- Step 1: Look at the beginning (Setup). Does it make this meaning likely?
- Step 2: Look at the middle (The Clue). Does the word usage fit?
- Step 3: Look at the end (The Conclusion). This is the most important part! Does the ending confirm or deny the meaning?
- The Rules: "If the ending clearly contradicts the meaning, you must give a 1 or 2." "If the evidence is mixed, lean toward the lower score."
The Result: This approach was the champion. By giving the AI a clear logic path and strict rules, it outperformed the student who memorized the textbook.

The Big Takeaways

1. Rules Beat Rulers
The "Mathematical Matchmaker" (measuring distance) failed because language isn't just about how close words are; it's about the story. You can't measure a plot twist with a ruler.

2. Thinking > Memorizing
The "Student" (Fine-tuning) was good, but the "Detective" (Prompting) was better. The AI didn't need to be retrained on millions of new examples; it just needed to be told how to think.

The Surprise: A slightly older, smaller AI model (GPT-4o) actually did better than a newer, massive model (GPT-5) because the instructions (the prompt) were so good. It proved that how you ask the question matters more than how big the brain is.

3. The "Ending" is King
The researchers found that the last sentence of the story is the most important clue. If the beginning sets you up one way, but the ending flips the script, the AI needs to trust the ending. The "Structured Detective" was explicitly told to prioritize the ending, which helped it avoid getting tricked by misleading beginnings.

The Final Score

The best system (The Structured Detective) achieved a score of 0.731 (out of 1.0) in matching human judgment. This means it was very good at understanding the nuance of human language, proving that sometimes, the best way to teach an AI isn't to feed it more data, but to give it a better set of instructions.

In short: Don't just give the AI a dictionary; give it a logic puzzle with clear rules, and it will solve it better than you expect.

Here is a detailed technical summary of the paper "NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating."

1. Problem Definition

The paper addresses SemEval-2026 Task 5, which focuses on Word Sense Plausibility Rating. Unlike traditional Word Sense Disambiguation (WSD) that selects a single correct sense, this task treats ambiguity as a graded spectrum.

Input: A five-sentence English narrative containing an ambiguous homonym. The story consists of a precontext (3 sentences), a target sentence (containing the homonym), and an ending (which may disambiguate the sense).
Task: Predict the human-perceived plausibility of a specific candidate word sense on a continuous 1–5 scale.
Dataset: The AmbiStory dataset, containing 2,280 training, 588 development, and 930 test samples. Gold labels are averages of ratings from at least five annotators.
Goal: Maximize Spearman correlation ( $\rho$ ) between predicted and gold ratings, and Accuracy (predictions within one standard deviation of the mean annotator rating).

2. Methodology

The authors systematically investigated and compared three distinct modeling paradigms:

A. Embedding-Based Methods

These approaches extract static features from sentence embeddings and feed them into classical regressors.

Models: MPNet and RoBERTa (via Sentence-BERT) to generate embeddings.
Features: Cosine similarity, Euclidean/Manhattan distance, dot products, text length, binary ending indicators, and interaction terms.
Regessors: Ridge Regression (with MPNet) and XGBoost (with RoBERTa).
Limitation: These methods rely on handcrafted similarity metrics between the story and the sense definition, failing to capture complex narrative reasoning.

B. Transformer Fine-Tuning

This approach adapts pre-trained language models using LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

Architectures: ELECTRA (base/large) and DeBERTa-large.
Input Format: [Meaning] [SEP] [Story].
Advanced Training Strategies:
- Loss Functions: Standard MSE, Huber Loss (for robustness), RankNet Pairwise Loss (to optimize Spearman correlation directly), and Uncertainty-Aware Loss (penalizing errors only when they exceed the standard deviation of annotator disagreement).
- Pooling: Mean pooling over all tokens (preferred over [CLS]) and attention-based pooling.
Goal: To learn contextual dependencies and handle annotator disagreement explicitly.

C. LLM Prompting

This approach utilizes Large Language Models (LLMs) without fine-tuning, relying on prompt engineering.

Models: GPT-4o, GPT-4.1, GPT-5 mini, GPT-5.2, Llama-3.2, and Ministral.
Strategies:
- P1 (Few-Shot): Providing 5 examples (one per rating level) with zero annotator disagreement.
- P2 (Structured Prompting with Decision Rules): The core contribution. This strategy replaces examples with:
  1. Component-wise Evaluation: Explicitly instructing the model to evaluate the precontext, target sentence, and ending separately.
  2. Decision Rules: Hard constraints for calibration (e.g., "If the ending contradicts the meaning, rate 1 or 2"; "Rating 5 requires explicit confirmation in the ending").
  3. Impartial Framing: Positioning the model as an objective evaluator based solely on the text.

3. Key Results

The experiments were conducted on the development and test sets, evaluated by Spearman correlation ( $\rho$ ) and Accuracy.

Approach	Best System	Test $\rho$	Test Acc.
Embedding	RoBERTa + XGBoost	0.133	0.522
Fine-Tuning	DeBERTa-large + LoRA (+ Uncertainty Loss)	0.435	0.659
LLM Prompting	GPT-4o (Structured Prompt P2)	0.731	0.794

Embedding Methods: Performed poorly ( $\rho < 0.14$ ), indicating that static similarity features cannot model narrative-level reasoning.
Fine-Tuning: Outperformed embeddings significantly but struggled to generalize to the test set ( $\rho \approx 0.43-0.52$ ), likely due to overfitting on specific homonym patterns.
LLM Prompting: GPT-4o with Structured Prompting (P2) achieved the state-of-the-art result.
- Switching from Few-Shot (P1) to Structured Prompting (P2) improved GPT-5.2's $\rho$ by 0.082.
- Model Scale vs. Prompt Design: GPT-4o (P2) outperformed the larger GPT-5.2 (P2), demonstrating that prompt design matters more than model scale for this specific task.

4. Key Contributions

Structured Prompting Framework: The introduction of a decomposition strategy that forces the LLM to evaluate narrative components (precontext, target, ending) separately and apply explicit decision rules. This significantly improved calibration and reduced bias.
Comparative Analysis: A rigorous benchmark showing that for graded plausibility tasks, reasoning capabilities induced by structured prompts in smaller LLMs (GPT-4o) outperform both parameter-efficient fine-tuning of larger discriminative models and embedding-based similarity approaches.
Loss Function Innovation: In the fine-tuning experiments, the integration of RankNet (for ranking correlation) and Uncertainty-Aware Loss (to handle annotator noise) provided a robust baseline, even if it was ultimately surpassed by prompting.

5. Error Analysis & Insights

Annotator Disagreement: Prediction difficulty correlates strongly with annotator standard deviation. High-disagreement samples ( $\sigma \ge 1.0$ ) had a Mean Absolute Error (MAE) of 0.962 vs. 0.765 for low-disagreement samples.
Mid-Range Difficulty: Ratings in the middle range (3.5–4.5) were the hardest to predict, while extreme ratings (1 or 5) were easier due to clear textual confirmation or contradiction.
Discretization Bias: Models tended to predict integer values (1–5) rather than continuous scores, limiting fine-grained estimation.
Contextual Conflicts: Errors often occurred when the precontext strongly primed one sense, but the ending confirmed another (e.g., "shelved" in a library context vs. "holding back"). The models often over-anchored on a single component rather than balancing all cues.

6. Significance

This work highlights a paradigm shift in NLP tasks involving nuanced human judgment. It suggests that for tasks requiring compositional reasoning and calibrated decision-making (like plausibility rating), structured prompting with explicit rules is more effective than:

Simply scaling model size.
Fine-tuning discriminative models on limited data.
Relying on static semantic similarity.

The findings imply that future systems for complex semantic evaluation should prioritize reasoning frameworks and decision logic within the prompt over raw model capacity or traditional fine-tuning pipelines. The code is publicly available to facilitate further research.