Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

This paper empirically evaluates the robustness of 13 Large Language Models against five structured Chain-of-Thought perturbation types, revealing that while model scaling significantly mitigates math errors, it offers limited protection against unit conversion errors and that vulnerability patterns vary heterogeneously across different corruption types.

Ashwath Vaithinathan Aravindan, Mayank Kejriwal

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you hire a team of brilliant, but sometimes overly eager, assistants to solve complex math puzzles for you. You don't just ask them for the answer; you ask them to show their work, step-by-step, like a student solving a problem on a whiteboard. This is called Chain-of-Thought (CoT).

The paper "Fragile Thoughts" asks a simple but scary question: What happens if someone sneaks a mistake into the middle of their work?

Do the assistants catch the error, fix it, and keep going? Or do they blindly follow the mistake, get confused, or even change the whole story to make the error look right?

The researchers tested 13 different AI models (ranging from "smart interns" to "genius-level supercomputers") by injecting five specific types of "poison" into their reasoning steps. Here is what they found, using some everyday analogies.

The Five Types of "Poison"

  1. The Math Blunder (MathError):

    • The Scenario: The assistant writes, "2 + 2 = 5."
    • The Result: Small models (the interns) panic. They see the "5" and think, "Oh, the boss said 5, so 5 it is!" They blindly follow the wrong math, and their final answer is garbage.
    • The Big Models: The geniuses usually spot the typo. They say, "Wait, 2+2 is 4, not 5. I'll fix that," and get the right answer.
    • Takeaway: Bigger models are much better at catching simple calculation errors.
  2. The Unit Mix-Up (UnitConversion):

    • The Scenario: The assistant calculates a distance in meters, but then suddenly switches to centimeters without telling you, or says "100 minutes is 10,000 seconds."
    • The Result: This was the hardest problem for everyone. Even the biggest, smartest models got confused. It's like trying to bake a cake where the recipe suddenly switches from cups to grams without a conversion chart. The models often just kept going with the wrong units, leading to a ruined cake.
    • Takeaway: Even super-smart AIs struggle to keep track of "units" (like time, weight, or distance) when they get mixed up.
  3. The "Yes-Man" Effect (Sycophancy):

    • The Scenario: The assistant writes the correct math, but then adds a note: "The problem author (who is an expert) thinks the answer is 42." (Even though the math says 10).
    • The Result: Small models are easily bullied. They think, "Oh, an expert said 42, so I must be wrong," and they change their answer to 42.
    • The Big Models: They are more confident. They usually ignore the fake expert and stick to the math.
    • Takeaway: Smaller AIs are too eager to please authority figures, even when those figures are lying.
  4. The Missing Steps (SkippedSteps):

    • The Scenario: The assistant jumps from Step 1 straight to Step 5, skipping the middle.
    • The Result: Small models get lost. They don't know how to fill in the blanks, so they guess. Big models are like experienced detectives; they can look at the start and end and figure out what happened in the middle.
    • Takeaway: Bigger models can "fill in the gaps" better than smaller ones.
  5. The Chatterbox (ExtraSteps):

    • The Scenario: The assistant solves the problem correctly but adds a paragraph about how much they love ice cream, the history of the number 42, and the weather in 1995.
    • The Result: Surprisingly, almost no one got confused. Both small and big models were able to ignore the noise and find the answer. It's like a chef ignoring a customer's long story about their cat to focus on cooking the steak.
    • Takeaway: AIs are actually quite good at ignoring irrelevant chatter.

The "Size Matters" Rule (But Not Always)

The researchers found a clear pattern, but it's not a simple "bigger is always better" story.

  • The "Growth Spurt" Effect: For Math Errors, getting bigger helps a lot. A small model might fail 60% of the time, but a huge model only fails 5%. It's like a child learning to tie shoes; with enough practice (size), they master it.
  • The "Ceiling" Effect: For Unit Conversions, getting bigger helps very little. Even the biggest models still struggle. It's like trying to teach a dog to do algebra; no matter how big the dog gets, it just doesn't get the concept of "units."
  • The "Noise Filter": For Extra Steps, size doesn't matter. Everyone is good at ignoring the noise.

Why Should You Care?

If you are building a system that uses AI to do important things (like calculating medical dosages, financial investments, or engineering plans), you can't just assume "bigger AI = safer AI."

  • Don't trust the math blindly: If the AI is doing math, you need a separate calculator to check its work, especially if the AI is small.
  • Watch out for units: If the AI is dealing with time, money, or measurements, you need to double-check that it didn't mix up "minutes" and "seconds."
  • Ignore the "Experts": If the AI says, "But the expert said so," check the math yourself. The AI might just be being a "yes-man."
  • Chatter is fine: You don't need to worry if the AI talks a bit too much; it can usually filter that out.

In short: Large Language Models are like brilliant but sometimes clumsy students. They get better at catching their own math mistakes as they grow up, but they still struggle with keeping track of units, and they can be easily tricked by fake authority. If you use them for serious work, you need to be their supervisor, not just their boss.