Semantic Invariance in Agentic AI

Imagine you have hired a team of brilliant, hyper-intelligent consultants (these are the AI models) to solve complex problems for your company. You ask them a question, and they give you a perfect answer. You are thrilled.

But then, you ask the exact same question again, just phrasing it slightly differently. Maybe you swap a few words, tell the story in a different order, or add a little extra background context.

The Shocking Discovery:
In many cases, your brilliant consultant suddenly gives you a completely different, wrong, or confused answer. It's as if they forgot what you asked the first time, even though the meaning hasn't changed at all.

This paper is about testing exactly how "jittery" these AI consultants are when you change the wording of their instructions. The authors call this "Semantic Invariance"—a fancy way of saying: "If the meaning is the same, the answer should be the same."

The Experiment: The "Shape-Shifting" Test

The researchers didn't just ask the AI the same question twice. They used a clever testing method called Metamorphic Testing. Think of it like a "shapeshifting" challenge for the AI.

They took 19 difficult problems (like math puzzles, physics scenarios, and logic riddles) and gave them to 7 different AI models. But before the AI could answer, the researchers applied 8 different "filters" to the questions:

The Mirror (Identity): Just asking the exact same question (to check if the AI is consistent with itself).
The Translator (Paraphrase): Rewording the question using different synonyms.
The Shuffler (Reorder): Mixing up the order of the facts. (e.g., "The ball is red" then "The ball is heavy" vs. "The ball is heavy" then "The ball is red").
The Storyteller (Expand): Adding extra, unnecessary details to the story.
The Editor (Contract): Cutting out all the fluff and getting straight to the point.
The Professor (Academic Context): Framing the question like a textbook exam.
The CEO (Business Context): Framing the question like a corporate meeting.
The Devil's Advocate (Contrastive): Adding a confusing "what if" scenario or a common mistake to distract the AI.

The Big Surprises

The results turned the common belief that "Bigger is Better" on its head.

1. The "Giant vs. The Sprinter" (Scale vs. Stability)

Usually, people assume a bigger AI (with more "brain power" or parameters) is smarter and more reliable.

The Reality: The researchers found that the smaller models were often more stable than the giants.
The Analogy: Imagine a massive, 400-ton cargo ship (the huge AI) and a nimble speedboat (the smaller AI). In calm water, the ship is impressive. But if the water gets choppy (the question gets reworded), the ship starts rolling wildly and loses its balance. The speedboat, however, cuts through the waves smoothly and stays on course.
The Winner: A smaller model called Qwen3-30B was the most reliable. It gave the same correct answer 80% of the time, even when the question was twisted. The massive models often got confused and changed their answers.

2. The "Family Traits" (Different AI Personalities)

Just like human families, different AI models have different weaknesses:

The Hermes Family: Great at solving problems, but if you add a "what if" scenario (the Devil's Advocate), they get easily distracted and fail.
The DeepSeek Family: They are very sensitive to the order of facts. If you shuffle the sentences, they get lost.
The gpt-oss Family: These were the most unstable. They were like a house of cards; a tiny breeze (a small change in wording) made them collapse.
The Qwen3 Family: The most balanced and reliable team. They didn't care much how you asked the question; they just solved it.

3. The "Distraction Trap"

The most dangerous test was the Contrastive one. This is when you ask a question but also throw in a confusing, fake alternative (e.g., "Solve this math problem, but remember, some people think the answer is 5, even though it's not").

The Result: Every single AI model got worse at this. It seems that when an AI tries to compare two things at once, its attention span breaks. It's like trying to listen to a friend while someone else is shouting a lie next to you; the AI gets confused and stops listening to the truth.

Why Does This Matter?

If you are building a self-driving car, a medical diagnosis tool, or a financial advisor using AI, you can't just trust the "standard test scores."

The Problem: Standard tests ask the AI the same question in the same way every time. They don't check if the AI is fragile.
The Risk: In the real world, people don't speak like robots. They mix up words, tell stories in different orders, and add distractions. If your AI is fragile, it might give a wrong medical diagnosis just because the doctor phrased the symptoms slightly differently.

The Takeaway

This paper teaches us that reliability is not the same as raw intelligence.

A model might be a genius at solving a specific puzzle, but if it can't handle a slight change in how the puzzle is described, it's not safe to use in the real world. The authors suggest that when we pick AI agents for important jobs, we shouldn't just pick the biggest one. We should pick the one that is calmest under pressure—the one that doesn't get flustered when you change the wording.

In short: Don't just look at how smart the AI is; look at how steady it is when you shake the table.

1. Problem Statement

The paper addresses a critical reliability gap in deploying Large Language Model (LLM) agents for high-stakes applications (e.g., medical diagnosis, scientific discovery, financial decision-making). While standard benchmarks (like MMLU or GSM8K) measure accuracy on fixed, canonical problem formulations, they fail to assess Semantic Invariance.

The Core Issue: LLM agents often exhibit fragility where semantically equivalent input variations (e.g., rephrasing, changing fact order, or altering context) lead to inconsistent reasoning or incorrect outputs.
The Limitation: Conventional accuracy metrics assume performance generalizes across paraphrases. However, evidence suggests LLMs are sensitive to superficial perturbations, undermining trust in autonomous agents where input formulations are inherently variable.
The Goal: To establish a framework for systematically assessing whether an agent produces consistent reasoning and conclusions when presented with semantically equivalent inputs.

2. Methodology: Metamorphic Testing Framework

The authors propose a Metamorphic Testing (MT) framework. Unlike traditional testing that requires ground-truth labels for every input, MT relies on Metamorphic Relations (MRs)—expected relationships between inputs and outputs. If an input is transformed semantically, the output should remain invariant.

A. Experimental Setup

Models Evaluated: 7 foundation models across 4 architectural families:
- Hermes: 70B and 405B (Dense Transformer).
- Qwen3: 30B-A3B and 235B-A22B (Mixture of Experts - MoE).
- DeepSeek: R1-0528 (MoE, reasoning-optimized).
- gpt-oss: 20B and 120B (Dense Transformer).
Dataset: 19 multi-step reasoning problems across 8 scientific domains (Physics, Math, Chemistry, Economics, Statistics, Biology, Calculus, Optimization) at three difficulty levels (Easy, Medium, Hard).
Transformations (8 MRs):
1. Structural: Identity (baseline), Paraphrase, Fact Reordering.
2. Verbosity: Expansion (adding context), Contraction (removing redundancy).
3. Contextual: Academic framing, Business framing, Contrastive formulation (introducing distractors/alternatives).

B. Evaluation Metrics

The framework evaluates stability at three levels:

Solution-Level:
- Semantic Similarity Score: Cosine similarity between the agent's solution embedding and the reference solution.
- Score Delta ( $\Delta$ ): The change in solution quality under transformation. $\Delta \approx 0$ indicates invariance; negative values indicate degradation.
Step-Level: Accuracy of individual reasoning steps against reference steps.
Trace-Level: Semantic similarity of the entire reasoning trace (coherence of the thought process).
Aggregate Metrics:
- Mean Absolute Delta (MAD): Average magnitude of score change (lower is better).
- Stability Rate: Percentage of instances where $|\Delta| < 0.05$ .

3. Key Contributions

First Systematic MT Framework for LLM Agents: A comprehensive taxonomy of 8 semantic-preserving transformations specifically designed to test reasoning robustness in agentic AI.
Multi-Dimensional Analysis: Moves beyond final answer accuracy to analyze reasoning trace coherence and step-level stability.
Cross-Architectural Benchmarking: Provides the first comparative analysis of robustness across diverse architectural families (Dense vs. MoE) and scales.
Discovery of "Scale-Robustness Inversion": Challenges the assumption that larger models are inherently more reliable.

4. Key Results & Findings

Finding 1: Scale-Robustness Inversion

Contrary to the belief that larger models are more robust, the study found an inverse relationship between parameter scale and stability in several families.

Top Performer: Qwen3-30B-A3B (30B total, 3B active) achieved the highest stability (79.6% invariant responses) and lowest MAD (0.049).
Fragility in Larger Models: Larger variants (e.g., Hermes-405B, gpt-oss-120B) often exhibited greater fragility than their smaller counterparts. For instance, gpt-oss-120B was significantly less stable than gpt-oss-20b on several metrics.

Finding 2: Distinct Model Family Signatures

Different architectures exhibit unique vulnerability profiles:

Qwen3: Demonstrated the most balanced robustness, with minimal degradation across all transformations.
Hermes: Strong baseline performance but highly vulnerable to contrastive transformations.
DeepSeek-R1: Showed high sensitivity to structural changes, particularly fact reordering, suggesting a reliance on input order for reasoning.
gpt-oss: Exhibited catastrophic instability, especially under contrastive and reordering transformations.

Finding 3: Universal Contrastive Fragility

The Contrastive Transformation (adding explicit alternative scenarios or misconceptions) was the only MR that universally degraded performance across all model families.

Performance drops were severe (up to $\Delta = -0.45$ for gpt-oss-120b).
This indicates a fundamental limitation in attention-based reasoning when distractors are present, regardless of model size or architecture.

Finding 4: Asymmetry in Verbosity

Expansion: Qwen3 models showed slight improvement or stability with added context, whereas gpt-oss and DeepSeek models degraded significantly, suggesting some architectures are overwhelmed by extra information while others benefit from it.

5. Significance and Implications

Rethinking Model Selection: The paper argues that for consequential applications, robustness should be prioritized over raw benchmark scores. A smaller, more stable model (like Qwen3-30B) may be safer for deployment than a larger, more fragile one.
Deployment Strategies:
- Task Orchestration: Agent frameworks should assign tasks based on specific model vulnerability profiles (e.g., avoiding contrastive inputs for Hermes models).
- Ensemble Methods: Combining models with complementary vulnerability patterns can mitigate individual weaknesses.
Future Research Directions:
- Development of robustness-aware fine-tuning objectives to explicitly optimize for semantic invariance.
- Extending metamorphic testing to multi-agent collaborative scenarios.
- Investigating the causes of scale-robustness inversion to improve training data and architecture design.

Conclusion

The paper demonstrates that standard accuracy benchmarks are insufficient for evaluating Agentic AI. Through metamorphic testing, it reveals that semantic invariance is not guaranteed by scale and that specific architectural choices significantly impact reliability. The findings provide a critical roadmap for selecting and deploying trustworthy LLM agents in real-world, variable environments.