Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

This paper demonstrates that while higher behavioral consistency in LLM-based agents generally correlates with improved accuracy on complex tasks like SWE-bench, consistency ultimately amplifies existing interpretations—whether correct or incorrect—meaning that interpretation accuracy is a more critical factor for production reliability than execution consistency alone.

Aman Mehta

Published 2026-03-30
📖 4 min read☕ Coffee break read

Imagine you hire three different mechanics to fix a very specific, tricky problem in your car's engine. You give them the exact same broken car and the same instructions. You ask them to try fixing it five times each.

This paper is like a report card on how those mechanics behaved. The researchers wanted to know: If you ask an AI (a "smart computer program") to do a task five times, will it do it the same way every time? And more importantly, does doing it the same way mean it's doing a good job?

Here is the breakdown of what they found, using simple analogies:

1. The Three Mechanics (The Models)

The researchers tested three different AI "mechanics" on fixing software bugs (like fixing a broken app):

  • Claude (The Meticulous Master): This AI is slow, careful, and reads the manual thoroughly.
  • GPT-5 (The Speed Demon): This AI is incredibly fast, tries things quickly, and moves on.
  • Llama (The Wild Card): This AI is a bit chaotic, sometimes guessing wildly, and often gets lost.

2. The Big Discovery: "Consistency is a Double-Edged Sword"

The most important lesson from the paper is this: Being consistent doesn't mean you are right. It just means you are reliably doing whatever you decided to do.

Think of it like a GPS navigation app:

  • Scenario A (The Good Consistency): You tell the GPS to go to the beach. It calculates the route, and every time you ask, it gives you the exact same perfect route. Result: You get to the beach every time.
  • Scenario B (The Bad Consistency): You tell the GPS to go to the beach, but it thinks the beach is actually a swamp. Because it is so confident and consistent, every single time you ask, it drives you to the swamp. It never changes its mind.

The paper found that the "Meticulous Master" (Claude) was the most consistent. However, 71% of the times it failed, it failed in the exact same way every single time. It got stuck on a wrong idea and refused to let go. It was "consistently wrong."

3. The Trade-Off: Speed vs. Reliability

The researchers noticed a funny triangle of trade-offs:

  • The Speed Demon (GPT-5): It was 4.7 times faster than the Master. It finished the job in a flash. But because it rushed, it was less accurate and its behavior was all over the place (sometimes it fixed it, sometimes it broke it, sometimes it took a weird path).
  • The Meticulous Master (Claude): It took a long time (lots of steps), but it was very steady. When it understood the problem, it solved it perfectly every time. When it misunderstood, it failed perfectly every time.
  • The Wild Card (Llama): It was the least consistent and the least accurate. It was like a mechanic who tries a new tool every time you ask, often making things worse.

4. The "First Step" Trap

You might think, "If they all start by looking at the same file, they should be consistent."
The paper found that starting the same way doesn't guarantee staying the same way.

  • The Speed Demon and the Master both started their work at almost the exact same moment (Step 3).
  • But after that, the Master stayed on a straight, narrow path. The Speed Demon started wandering off in different directions every time.
  • Lesson: It's not about when they start disagreeing; it's about how well they stick to a plan after they start.

5. Why This Matters for the Real World

If you are building a robot or an AI assistant to work in a hospital or a bank, you need to know this:

  • Don't just trust consistency. If an AI is very consistent, it might just be confidently wrong.
  • The real problem is "Understanding." The biggest reason these AIs failed wasn't that they made a typo or forgot a step. It was that they misunderstood the job description in the first place.
  • The Fix: We need to teach AIs to double-check their understanding of the problem, not just to be faster or more consistent in their typing.

Summary in One Sentence

Being a reliable robot is great, but if that robot is confidently wrong, it's actually more dangerous than a robot that is a little chaotic but occasionally gets the right answer.

The paper tells us that for AI to be truly useful, we need to focus on making sure they understand the task correctly first, because once they understand it, their consistency will naturally make them excellent workers.