NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating

This paper presents the NCL-UoR system for SemEval-2026 Task 5, demonstrating that a structured prompting strategy with explicit decision rules for Large Language Models outperforms both embedding-based methods and fine-tuned transformers in rating word sense plausibility.

Tong Wu, Thanet Markchom, Huizhi Liang

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are a judge at a talent show, but instead of singing or dancing, the contestants are words.

Specifically, these are words that have double meanings (like "bank," which could be a place to keep money or the side of a river). Your job is to read a short, five-sentence story and decide: "How likely is it that the word is being used in this specific meaning?"

You have to give a score from 1 to 5:

  • 1: "No way! That makes zero sense here."
  • 5: "Absolutely! That is exactly what the story means."

This is the challenge of SemEval-2026 Task 5, and a team of researchers (NCL-UoR) built three different types of "AI Judges" to solve it. Here is how they did it, explained simply.


The Three Contenders

The team tried three different ways to teach their AI how to judge these stories.

1. The "Mathematical Matchmaker" (Embedding-Based Methods)

The Analogy: Imagine you have a library of books. You take the story and the word meaning, turn them both into a single "fingerprint" (a list of numbers), and then measure how close those fingerprints are. If the fingerprints are close, the AI thinks the meaning fits.

  • How it worked: They used a ruler to measure the distance between the story's "fingerprint" and the word's "fingerprint."
  • The Result: It was like trying to solve a complex mystery by only looking at the color of the suspect's shoes. It failed miserably. The AI couldn't understand the story; it only saw that the words were vaguely similar. It got a very low score.

2. The "Student Who Memorized the Textbook" (Fine-Tuning)

The Analogy: This is like taking a very smart student (a pre-trained AI model) and forcing them to study thousands of these specific stories until they memorized the patterns. We gave them special tools (called LoRA) to help them learn faster without forgetting everything else they know.

  • How it worked: We showed the AI thousands of examples and said, "When you see this setup, give a score of 3. When you see that ending, give a 5."
  • The Result: This student did much better than the Matchmaker. They understood the context well. However, when they faced a new type of story they hadn't seen before, they got a bit confused. They were too rigid, relying on what they memorized rather than thinking flexibly.

3. The "Structured Detective" (LLM Prompting)

The Analogy: This is the winner. Instead of forcing the AI to memorize, we gave it a checklist and a rulebook. We told the AI: "Don't just guess. Act like a detective. Break the story into three parts: The Setup, The Clue, and The Conclusion. Check each part against the rulebook, then give your verdict."

  • The Strategy:
    • Step 1: Look at the beginning (Setup). Does it make this meaning likely?
    • Step 2: Look at the middle (The Clue). Does the word usage fit?
    • Step 3: Look at the end (The Conclusion). This is the most important part! Does the ending confirm or deny the meaning?
    • The Rules: "If the ending clearly contradicts the meaning, you must give a 1 or 2." "If the evidence is mixed, lean toward the lower score."
  • The Result: This approach was the champion. By giving the AI a clear logic path and strict rules, it outperformed the student who memorized the textbook.

The Big Takeaways

1. Rules Beat Rulers
The "Mathematical Matchmaker" (measuring distance) failed because language isn't just about how close words are; it's about the story. You can't measure a plot twist with a ruler.

2. Thinking > Memorizing
The "Student" (Fine-tuning) was good, but the "Detective" (Prompting) was better. The AI didn't need to be retrained on millions of new examples; it just needed to be told how to think.

  • The Surprise: A slightly older, smaller AI model (GPT-4o) actually did better than a newer, massive model (GPT-5) because the instructions (the prompt) were so good. It proved that how you ask the question matters more than how big the brain is.

3. The "Ending" is King
The researchers found that the last sentence of the story is the most important clue. If the beginning sets you up one way, but the ending flips the script, the AI needs to trust the ending. The "Structured Detective" was explicitly told to prioritize the ending, which helped it avoid getting tricked by misleading beginnings.

The Final Score

The best system (The Structured Detective) achieved a score of 0.731 (out of 1.0) in matching human judgment. This means it was very good at understanding the nuance of human language, proving that sometimes, the best way to teach an AI isn't to feed it more data, but to give it a better set of instructions.

In short: Don't just give the AI a dictionary; give it a logic puzzle with clear rules, and it will solve it better than you expect.