AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

This paper introduces AgentCoMa, a new benchmark demonstrating that while large language models can handle isolated commonsense and mathematical reasoning steps, their performance significantly degrades when combining both types of reasoning in real-world scenarios, revealing a substantial brittleness that is not observed in human annotators or prior benchmarks.

Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are hiring a very smart, well-read robot assistant to help you plan your life. You ask it to do two things at once: figure out what makes sense (like "I shouldn't mop a carpet") and do the math (like "If I mop 10 square meters, how much time does that take?").

You'd expect this robot to be great at both. And it is! If you ask it just the common sense question, it gets it right. If you ask it just the math question, it gets that right too.

But here is the twist: When you ask it to do both together, the robot suddenly starts acting like it's lost its mind. It forgets the common sense rule, or it gets the math wrong, even though it knew both answers perfectly when asked separately.

This is exactly what a new research paper called AgentCoMa discovered.

The "Brain Glitch" Experiment

The researchers created a special test (a benchmark) called AgentCoMa. Think of it as a gym workout for AI brains, but instead of lifting weights, the AI has to combine two different types of thinking:

  1. Common Sense (The "Fast Brain"): Like knowing that you can't iron a wool sweater because it will shrink, or that you shouldn't vacuum a leather chair. This is intuitive, everyday knowledge.
  2. Math (The "Slow Brain"): Like calculating the total cost of groceries or figuring out how many days a project will take. This requires logic and numbers.

They tested 61 different AI models (from small ones to massive super-brains) on these mixed tasks.

The Big Surprise: The "Compositionality Gap"

Here is what happened:

  • The Isolated Test: When the AI was asked only the common sense part, it scored 85%. When asked only the math part, it also scored 85%.
  • The Combined Test: When asked to do both in one go, the score plummeted to 42%.

That is a 43% drop in performance!

To put this in perspective, the researchers also asked regular humans (non-experts) to take the test. The humans didn't have this problem. They could do the common sense part, the math part, and the combined part with almost the same high accuracy. The AI, however, was falling apart.

Why is the AI failing? (The "Neuron" Mystery)

The researchers didn't just stop at the score; they looked inside the AI's "brain" (its neural network) to see what was going wrong. They found three main reasons:

  1. The "Unfamiliar Recipe" Problem:
    Imagine a chef who is a master at making pizza and a master at making pasta. But if you ask them to make a "Pasta-Pizza" (a weird hybrid dish), they freeze. They have never seen a recipe that mixes these two specific things together in their training data. The AI is used to seeing math questions or common sense questions, but rarely both mixed up. It doesn't know how to "switch gears" quickly enough.

  2. The "One-Track Mind" Problem:
    When the AI tries to solve the mixed problem, it gets confused and decides to ignore one part of the brain.

    • The Metaphor: Imagine you are driving a car that has two engines: one for steering (common sense) and one for speed (math). When you try to drive a tricky road, the car decides to turn off the steering engine and just floor the gas pedal. It zooms forward with perfect math but crashes because it forgot to steer around the common-sense obstacles.
    • The study found that when solving these mixed tasks, the AI mostly activated the "math neurons" and completely ignored the "common sense neurons."
  3. The "Hallucination" Effect:
    Because the AI is ignoring the common sense part, it starts making things up. It might confidently calculate the cost of mopping a carpet, even though carpets can't be mopped. It sounds confident and logical, but it's completely wrong because it lost the context.

What Does This Mean for the Future?

This paper is a wake-up call. It shows that even the smartest AI models today are brittle. They are like a student who can ace a math test and a history test separately, but if you give them a word problem that requires history knowledge to solve the math, they panic.

The Takeaway:
To build truly helpful AI agents (robots that can plan your trip, manage your budget, and cook your dinner), we can't just make them smarter at math or better at facts. We need to teach them how to blend these skills together. We need to train them to keep both their "fast brain" and "slow brain" active at the same time.

Until then, if you ask an AI to do something complex that mixes logic and real-world rules, you might want to double-check its work!