LLMs Struggle with Abstract Meaning Comprehension More Than Expected

This paper reveals that despite their general capabilities, large language models struggle significantly with abstract meaning comprehension compared to fine-tuned models, but proposes a bidirectional attention classifier inspired by human cognition that successfully enhances fine-tuned models' performance on the SemEval-2021 Task 4 (ReCAM) benchmark.

Hamoud Alhazmi, Jiachen Jiang

Published 2026-04-15
📖 5 min read🧠 Deep dive

🧠 The Big Problem: Why AI Gets "Abstract" Concepts Wrong

Imagine you are teaching a very smart robot how to read. You show it a story about a cat chasing a mouse. The robot has no trouble. It knows what a cat and a mouse look like; they are concrete things.

But then, you ask the robot to read a story about freedom, justice, or the economy. Suddenly, the robot gets confused. These are "abstract" ideas. You can't touch them, see them, or hold them in your hand. They are like invisible clouds of meaning.

The Paper's Main Discovery:
The researchers found that even the most famous, super-smart AI models (like GPT-4o, the "brain" behind many chatbots) are actually terrible at understanding these invisible clouds.

  • The Test: They used a game called "ReCAM." It's like a fill-in-the-blank quiz where you have to pick the right abstract word to finish a sentence.
  • The Result: The super-smart AIs (LLMs) got about 65% to 73% right. That sounds okay, but in the world of AI, that's a failing grade. The best human-designed models got 95% right.
  • The Takeaway: Just because an AI can write a poem or chat about your day doesn't mean it truly understands deep, abstract concepts. It's like a parrot that can repeat the word "freedom" perfectly but doesn't know what it actually means.

🛠️ The Solution: Teaching the AI to "Look Back and Forth"

Since the big, expensive AI models were struggling, the researchers decided to try something different. Instead of using the giant "foundation" models, they took a slightly smaller, more focused model (called ELECTRA) and gave it a special new brain upgrade.

They called this upgrade the Bi-Directional Attention Classifier.

The Human Analogy: The Detective's Two-Step Process

Imagine you are a detective trying to solve a mystery. You have a Clue (the passage) and a List of Suspects (the answer options).

  1. Step 1 (Passage \to Question): You look at the Clue first. You think, "Okay, this clue mentions 'money' and 'growth.' Which suspect fits that?" You scan your list of suspects to find the match.
  2. Step 2 (Question \to Passage): Now, you look at the Suspects. You think, "Suspect A is 'Security.' Does the Clue mention security?" You scan the Clue again to see if it supports this suspect.

Humans do this back-and-forth naturally. We look at the story, then the question, then the story again, until the pieces fit.

The AI's New Superpower:
The researchers programmed the AI to do exactly this.

  • Old Way: The AI looked at the story once, then guessed the answer.
  • New Way (Bi-Directional): The AI looks at the story while looking at the options, and then looks at the options while looking at the story. It creates a "conversation" between the text and the choices.

The Result

When they gave the AI this "two-way looking" ability, its score jumped up significantly!

  • Task 1 (Understanding invisible concepts): Improved by 4%.
  • Task 2 (Understanding broad categories): Improved by 3.4%.

It wasn't just a tiny bump; it was enough to make their system one of the top 3 in the world for this specific test.


🏆 The Winners and Losers

Here is a quick summary of who did what in the race:

The Contender The Strategy The Result
Giant AI Models (GPT-4o, etc.) Tried to guess the answer by reading the text once. Struggled. They got about 70% right. They are great at chatting, but bad at deep abstract logic.
Standard AI (RoBERTa, ELECTRA) Tried to learn the task by studying many examples (Fine-tuning). Better. They got about 85-89% right.
The Researchers' AI (ELECTRA + Bi-Directional) Studied examples AND used the "Two-Step Detective" method. Champion. They got over 90% right. They proved that a smart, focused approach beats a giant, unfocused one.

💡 The "So What?" for You

Why does this matter?

  1. AI isn't magic yet: Even the most advanced AIs have blind spots. They can mimic human language, but they still struggle with the "invisible" parts of meaning that humans grasp easily.
  2. Small changes, big wins: You don't always need a super-computer to solve a problem. Sometimes, just changing how the AI looks at information (like teaching it to look back and forth) is more effective than making the AI bigger.
  3. The Future: To make AI truly smart, we need to teach it not just to memorize facts, but to understand the relationships between abstract ideas, just like a human detective does.

In a nutshell: The paper says, "Hey, the big AI models are actually bad at abstract thinking. But if we teach a smaller model to look at the problem from two different angles at the same time, it becomes a genius."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →