Agentified Assessment of Logical Reasoning Agents

This paper introduces an agentified assessment framework that utilizes an assessor agent to ensure reproducible and robust evaluation of logical reasoning systems, demonstrating its effectiveness by benchmarking an auto-formalization agent that achieves 86.70% accuracy on a solver-verified FOLIO dataset, significantly outperforming a chain-of-thought baseline.

Zhiyu Ni, Yifeng Xiao, Zheng Liang

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are trying to judge how good a new robot is at solving logic puzzles. Usually, when we test robots, we just look at the final answer: "Did they get it right or wrong?" But this paper argues that's like judging a chef only by whether the food tastes good, without checking if they burned the kitchen down, forgot the salt, or used the wrong oven.

Here is a simple breakdown of what the researchers did, using some everyday analogies.

1. The Problem: The "Black Box" Judge

In the past, testing AI agents (smart computer programs) was messy. If an AI failed, the test script would just say "Error." Was the AI stupid? Did it run out of time? Did it crash because of a typo? We didn't know. It was like a teacher grading a test but only giving a score of "0" without telling you if the student forgot to write their name, ran out of time, or actually got the math wrong.

2. The Solution: The "Referee Agent"

The authors created a new way to test AI called Agentified Assessment.

  • The Old Way: A rigid script that runs the test. If the script breaks, the whole test breaks.
  • The New Way: They created a Referee Agent. Think of this as a human-like referee in a sports game.
    • The Player (the AI being tested) just needs to know how to talk to the referee.
    • The Referee (the Assessor Agent) is in charge. It hands out the puzzle, sets a timer, watches the player work, and if the player crashes or talks nonsense, the Referee notes exactly why (e.g., "Timeout," "Syntax Error," or "Logic Flaw").
    • The Benefit: The Player doesn't need to change their internal brain to fit a specific test. They just need to know the "Referee Language." This makes testing fairer and easier to repeat.

3. The Puzzle: Fixing the "Textbook" (Data Cleaning)

Before testing, the researchers looked at the puzzle book they were using (called FOLIO). They found that the book had some errors. Sometimes the "correct" answer in the back of the book didn't actually match the puzzle, or the translation from English to "Logic Language" was broken.

  • The Analogy: Imagine a math textbook where the answers in the back are wrong, or the numbers in the problems are typos. If you use that book to test students, the results are useless.
  • The Fix: They built a pipeline (a cleaning crew) that used a super-smart logic machine (a theorem prover) to check every single puzzle. If the answer didn't match the logic, they fixed the puzzle or flagged it for a human to review. This created a "Gold Standard" version of the test.

4. The Contest: Two Different Strategies

They tested two different AI agents on this cleaned-up test:

  • Contestant A (The "Talker"): This agent uses Chain-of-Thought. It's like a student who talks through the problem out loud: "Well, if A is true, then B must be true, so the answer is..." It relies on the AI's intuition and language skills.
  • Contestant B (The "Coder"): This agent uses Auto-Formalization. Instead of just talking, it translates the English puzzle into a strict computer code (Z3Py) that a logic machine can run. It's like a student who refuses to guess; instead, they write out a strict proof and run it through a calculator to get the answer. If the code crashes, it tries to fix the typo and run it again.

5. The Results: Why the "Coder" Won

The results were clear:

  • The Talker got about 74% of the answers right.
  • The Coder got about 87% of the answers right.

Why did the Coder win?
The biggest jump was in the "False" category (contradictions).

  • Analogy: Imagine a trick question: "All cats are dogs. My pet is a cat. Therefore, my pet is a dog."
    • The Talker might get confused by the weird logic and guess.
    • The Coder translates it into strict rules. The logic machine immediately sees the contradiction and says, "This is impossible!"

The "Coder" was much better at spotting when something was logically impossible or uncertain, because it didn't rely on "feeling" the answer; it relied on running the logic through a machine that can't be tricked.

The Big Takeaway

This paper shows two main things:

  1. We need better referees: We should stop using rigid scripts to test AI and start using "Referee Agents" that can tell us exactly how and why an AI failed, not just that it failed.
  2. Logic beats guessing: When it comes to hard logic puzzles, translating the problem into strict code and letting a computer solve it is much more reliable than just asking the AI to "think" about it in plain English.

In short: If you want an AI to be a logician, don't just ask it to chat; give it a calculator and a strict rulebook.