Semantic Invariance in Agentic AI

This paper introduces a metamorphic testing framework to evaluate the semantic invariance of LLM agents across various transformations and models, revealing that reasoning stability does not correlate with model scale and that smaller models like Qwen3-30B-A3B can outperform larger counterparts in robustness.

I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate

Published 2026-03-16
📖 5 min read🧠 Deep dive

Imagine you have hired a team of brilliant, hyper-intelligent consultants (these are the AI models) to solve complex problems for your company. You ask them a question, and they give you a perfect answer. You are thrilled.

But then, you ask the exact same question again, just phrasing it slightly differently. Maybe you swap a few words, tell the story in a different order, or add a little extra background context.

The Shocking Discovery:
In many cases, your brilliant consultant suddenly gives you a completely different, wrong, or confused answer. It's as if they forgot what you asked the first time, even though the meaning hasn't changed at all.

This paper is about testing exactly how "jittery" these AI consultants are when you change the wording of their instructions. The authors call this "Semantic Invariance"—a fancy way of saying: "If the meaning is the same, the answer should be the same."

The Experiment: The "Shape-Shifting" Test

The researchers didn't just ask the AI the same question twice. They used a clever testing method called Metamorphic Testing. Think of it like a "shapeshifting" challenge for the AI.

They took 19 difficult problems (like math puzzles, physics scenarios, and logic riddles) and gave them to 7 different AI models. But before the AI could answer, the researchers applied 8 different "filters" to the questions:

  1. The Mirror (Identity): Just asking the exact same question (to check if the AI is consistent with itself).
  2. The Translator (Paraphrase): Rewording the question using different synonyms.
  3. The Shuffler (Reorder): Mixing up the order of the facts. (e.g., "The ball is red" then "The ball is heavy" vs. "The ball is heavy" then "The ball is red").
  4. The Storyteller (Expand): Adding extra, unnecessary details to the story.
  5. The Editor (Contract): Cutting out all the fluff and getting straight to the point.
  6. The Professor (Academic Context): Framing the question like a textbook exam.
  7. The CEO (Business Context): Framing the question like a corporate meeting.
  8. The Devil's Advocate (Contrastive): Adding a confusing "what if" scenario or a common mistake to distract the AI.

The Big Surprises

The results turned the common belief that "Bigger is Better" on its head.

1. The "Giant vs. The Sprinter" (Scale vs. Stability)

Usually, people assume a bigger AI (with more "brain power" or parameters) is smarter and more reliable.

  • The Reality: The researchers found that the smaller models were often more stable than the giants.
  • The Analogy: Imagine a massive, 400-ton cargo ship (the huge AI) and a nimble speedboat (the smaller AI). In calm water, the ship is impressive. But if the water gets choppy (the question gets reworded), the ship starts rolling wildly and loses its balance. The speedboat, however, cuts through the waves smoothly and stays on course.
  • The Winner: A smaller model called Qwen3-30B was the most reliable. It gave the same correct answer 80% of the time, even when the question was twisted. The massive models often got confused and changed their answers.

2. The "Family Traits" (Different AI Personalities)

Just like human families, different AI models have different weaknesses:

  • The Hermes Family: Great at solving problems, but if you add a "what if" scenario (the Devil's Advocate), they get easily distracted and fail.
  • The DeepSeek Family: They are very sensitive to the order of facts. If you shuffle the sentences, they get lost.
  • The gpt-oss Family: These were the most unstable. They were like a house of cards; a tiny breeze (a small change in wording) made them collapse.
  • The Qwen3 Family: The most balanced and reliable team. They didn't care much how you asked the question; they just solved it.

3. The "Distraction Trap"

The most dangerous test was the Contrastive one. This is when you ask a question but also throw in a confusing, fake alternative (e.g., "Solve this math problem, but remember, some people think the answer is 5, even though it's not").

  • The Result: Every single AI model got worse at this. It seems that when an AI tries to compare two things at once, its attention span breaks. It's like trying to listen to a friend while someone else is shouting a lie next to you; the AI gets confused and stops listening to the truth.

Why Does This Matter?

If you are building a self-driving car, a medical diagnosis tool, or a financial advisor using AI, you can't just trust the "standard test scores."

  • The Problem: Standard tests ask the AI the same question in the same way every time. They don't check if the AI is fragile.
  • The Risk: In the real world, people don't speak like robots. They mix up words, tell stories in different orders, and add distractions. If your AI is fragile, it might give a wrong medical diagnosis just because the doctor phrased the symptoms slightly differently.

The Takeaway

This paper teaches us that reliability is not the same as raw intelligence.

A model might be a genius at solving a specific puzzle, but if it can't handle a slight change in how the puzzle is described, it's not safe to use in the real world. The authors suggest that when we pick AI agents for important jobs, we shouldn't just pick the biggest one. We should pick the one that is calmest under pressure—the one that doesn't get flustered when you change the wording.

In short: Don't just look at how smart the AI is; look at how steady it is when you shake the table.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →