Agentified Assessment of Logical Reasoning Agents

Imagine you are trying to judge how good a new robot is at solving logic puzzles. Usually, when we test robots, we just look at the final answer: "Did they get it right or wrong?" But this paper argues that's like judging a chef only by whether the food tastes good, without checking if they burned the kitchen down, forgot the salt, or used the wrong oven.

Here is a simple breakdown of what the researchers did, using some everyday analogies.

1. The Problem: The "Black Box" Judge

In the past, testing AI agents (smart computer programs) was messy. If an AI failed, the test script would just say "Error." Was the AI stupid? Did it run out of time? Did it crash because of a typo? We didn't know. It was like a teacher grading a test but only giving a score of "0" without telling you if the student forgot to write their name, ran out of time, or actually got the math wrong.

2. The Solution: The "Referee Agent"

The authors created a new way to test AI called Agentified Assessment.

The Old Way: A rigid script that runs the test. If the script breaks, the whole test breaks.
The New Way: They created a Referee Agent. Think of this as a human-like referee in a sports game.
- The Player (the AI being tested) just needs to know how to talk to the referee.
- The Referee (the Assessor Agent) is in charge. It hands out the puzzle, sets a timer, watches the player work, and if the player crashes or talks nonsense, the Referee notes exactly why (e.g., "Timeout," "Syntax Error," or "Logic Flaw").
- The Benefit: The Player doesn't need to change their internal brain to fit a specific test. They just need to know the "Referee Language." This makes testing fairer and easier to repeat.

3. The Puzzle: Fixing the "Textbook" (Data Cleaning)

Before testing, the researchers looked at the puzzle book they were using (called FOLIO). They found that the book had some errors. Sometimes the "correct" answer in the back of the book didn't actually match the puzzle, or the translation from English to "Logic Language" was broken.

The Analogy: Imagine a math textbook where the answers in the back are wrong, or the numbers in the problems are typos. If you use that book to test students, the results are useless.
The Fix: They built a pipeline (a cleaning crew) that used a super-smart logic machine (a theorem prover) to check every single puzzle. If the answer didn't match the logic, they fixed the puzzle or flagged it for a human to review. This created a "Gold Standard" version of the test.

4. The Contest: Two Different Strategies

They tested two different AI agents on this cleaned-up test:

Contestant A (The "Talker"): This agent uses Chain-of-Thought. It's like a student who talks through the problem out loud: "Well, if A is true, then B must be true, so the answer is..." It relies on the AI's intuition and language skills.
Contestant B (The "Coder"): This agent uses Auto-Formalization. Instead of just talking, it translates the English puzzle into a strict computer code (Z3Py) that a logic machine can run. It's like a student who refuses to guess; instead, they write out a strict proof and run it through a calculator to get the answer. If the code crashes, it tries to fix the typo and run it again.

5. The Results: Why the "Coder" Won

The results were clear:

The Talker got about 74% of the answers right.
The Coder got about 87% of the answers right.

Why did the Coder win?
The biggest jump was in the "False" category (contradictions).

Analogy: Imagine a trick question: "All cats are dogs. My pet is a cat. Therefore, my pet is a dog."
- The Talker might get confused by the weird logic and guess.
- The Coder translates it into strict rules. The logic machine immediately sees the contradiction and says, "This is impossible!"

The "Coder" was much better at spotting when something was logically impossible or uncertain, because it didn't rely on "feeling" the answer; it relied on running the logic through a machine that can't be tricked.

The Big Takeaway

This paper shows two main things:

We need better referees: We should stop using rigid scripts to test AI and start using "Referee Agents" that can tell us exactly how and why an AI failed, not just that it failed.
Logic beats guessing: When it comes to hard logic puzzles, translating the problem into strict code and letting a computer solve it is much more reliable than just asking the AI to "think" about it in plain English.

In short: If you want an AI to be a logician, don't just ask it to chat; give it a calculator and a strict rulebook.

Here is a detailed technical summary of the paper "Agentified Assessment of Logical Reasoning Agents":

1. Problem Statement

Evaluating reasoning agents is currently hindered by two primary issues:

Conflation of Failure Modes: Traditional static evaluation harnesses often mix operational failures (e.g., timeouts, runtime errors, parsing failures) with actual reasoning errors. This results in a single accuracy metric that obscures the specific causes of failure.
High Integration Costs: Traditional benchmarks tightly couple the benchmark logic with specific agent implementations. As the number of benchmarks grows, the effort required to integrate new agents increases linearly ( $O(n)$ ), making scalable and reproducible evaluation difficult.
Data Reliability: Existing datasets, such as FOLIO, contain label errors and misalignments between natural language (NL) and formal logic annotations due to the complexity of semantic parsing, leading to unreliable ground truth.

2. Methodology

The authors propose a comprehensive framework addressing data quality, evaluation architecture, and agent design.

A. Data Cleaning and Verification Pipeline

To establish a reliable benchmark, the authors implemented a systematic data cleaning pipeline on the FOLIO dataset (a first-order logic reasoning benchmark):

Symbolic Verification: They use the Vampire theorem prover to verify the logical consistency of premises and the implication relation between premises and conclusions.
Implication Checking:
- TRUE: Premises $\land \neg$ Conclusion is unsatisfiable.
- FALSE: Premises $\land$ Conclusion is unsatisfiable.
- UNCERTAIN: Neither condition holds.
Automated Repair: When verification conflicts with expected labels, a two-agent LLM system is employed:
- A Critique Agent diagnoses translation errors (e.g., syntax errors, typos, naming inconsistencies).
- A Refiner Agent executes targeted corrections.
Human Review: Instances failing automated repair after a threshold are flagged for expert review.

Outcome: This process identified and repaired label errors, resulting in a "cleaned" FOLIO split with higher NL-FOL alignment.

B. Agentified Assessment Framework (AAA)

The core methodological contribution is the Agentified Agent Assessment (AAA) framework, which treats the evaluation process itself as an autonomous agent.

Architecture: The system decouples the Agent Under Test (AUT) from the Assessor Agent.
- AUT: Only needs to expose a standardized Agent-to-Agent (A2A) interface. It performs reasoning and outputs results.
- Assessor Agent: Issues tasks, enforces execution budgets (timeouts, retries), parses outputs, and records structured failure types.
Benefits:
- Decoupling: Reduces integration cost from $O(n)$ to $O(1)$ (an agent implements A2A once and can be tested by many assessors).
- Auditability: The assessor records structured failure types (e.g., TIMEOUT, RUNTIMEERROR, PARSEERROR) rather than discarding them, providing granular insights into agent performance.
- Reproducibility: Evaluation artifacts are machine-consumable and include latency, error types, and per-instance correctness.

C. Agents Under Test

The framework benchmarks two distinct agents:

Chain-of-Thought (CoT) Baseline: Uses standard prompting to reason step-by-step and output a final label.
Auto-Formalization Agent:
- Stage 1 (Code Generation): Translates NL premises/conclusions into executable Z3Py (Python bindings for the Z3 SMT solver) code.
- Stage 2 (Execution): Runs the code in a sandbox with a 60-second timeout.
- Self-Repair: If execution fails (syntax errors, etc.), the agent extracts error messages and attempts up to three targeted code repairs before re-executing.

3. Key Contributions

Agentified Assessment Protocol: A novel evaluation paradigm where an assessor agent manages the benchmark lifecycle, enabling plug-and-play evaluation and detailed failure analysis.
Solver-Verified Benchmark: A cleaned and repaired split of the FOLIO dataset, verified via symbolic theorem proving, which serves as a more reliable ground truth for logical reasoning.
Structured Failure Analysis: A shift from binary success/failure metrics to a structured taxonomy of errors (parsing, runtime, timeout), allowing for deeper diagnostic analysis.
Performance Benchmarking: A comparative study demonstrating that formal verification via SMT solvers significantly outperforms pure neural reasoning on logical entailment tasks.

4. Experimental Results

The agents were evaluated on the cleaned FOLIO validation set (203 examples) using Gemini 2.5 Flash as the backbone LLM.

Overall Accuracy:
- Auto-Formalization Agent: 86.70% (176/203 correct).
- Chain-of-Thought Baseline: 73.89% (150/203 correct).
Category Breakdown:
- FALSE (Contradiction): The auto-formalization agent showed the most significant improvement, rising from 44.26% (CoT) to 77.05%. This highlights the solver's ability to rigorously detect contradictions.
- UNCERTAIN: Improved from 84.06% to 91.30%, demonstrating better handling of logical indeterminacy.
- TRUE: Performance was comparable (89.04% vs. 90.41%).
Failure Analysis: The assessor successfully categorized non-parsable outputs and runtime errors, providing a clear view of where the CoT baseline failed (often due to reasoning hallucinations or formatting issues) versus where the auto-formalization agent failed (mostly execution timeouts or complex repair loops).

5. Significance and Future Work

Robustness: The study proves that combining LLMs with formal solvers (SMT) and self-repair mechanisms yields superior logical reasoning capabilities compared to pure CoT prompting.
Scalability: The AAA framework solves the "integration bottleneck" in agent evaluation, allowing researchers to rapidly test new agents against diverse benchmarks without rewriting evaluation logic.
Transparency: By recording structured failure types, the framework moves the field toward more transparent and reproducible AI evaluation, distinguishing between "the model didn't know" and "the system crashed."
Future Directions: The authors suggest extending assessor policies and applying this framework to richer tool-using agent environments beyond logical reasoning.