Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

The paper introduces CGD-PD, a lightweight test-time framework that enhances three-way logical question answering by enforcing negation consistency and employing proof-driven disambiguation to resolve uncertainty, thereby significantly improving accuracy and reducing "Unknown" predictions across frontier large language models.

Tianyi Huang, Ming Hou, Jiaheng Su, Yutong Zhang, Ziling Zhang

Published 2026-04-09
📖 4 min read☕ Coffee break read

Imagine you are hiring a very smart, but slightly nervous, detective to solve a mystery. You give them a set of clues (the Premises) and a specific theory about what happened (the Hypothesis).

Your detective has three possible answers:

  1. "Yes, it's true" (The clues prove the theory).
  2. "No, it's false" (The clues prove the theory is wrong).
  3. "I don't know" (The clues aren't enough to decide).

The problem is, modern AI detectives (Large Language Models) are great at solving easy cases, but they often get tripped up by two specific bugs:

The Two Bugs in the Detective's Brain

1. The "Flip-Flop" Bug (Negation Inconsistency)
Imagine you ask the detective: "Is it true that the butler did it?" They say, "Yes."
Then, you immediately ask: "Is it true that the butler did NOT do it?"
A logical human would say, "Well, if he did it, then he didn't not do it, so the answer is No."
But the AI detective might get confused and say, "Yes, he didn't do it" or "I don't know." It's like a person who can't keep their story straight when you ask the same question in a slightly different way.

2. The "Cowardly" Bug (Epistemic Unknown)
Sometimes, the clues are actually clear enough to solve the case. But the detective is just a bit nervous or unsure, so they play it safe and say, "I don't know." They are hiding behind a "I don't know" shield even when they actually have enough evidence to make a call. This makes the detective look safe, but useless.


The Solution: The "Double-Check" System (CGD-PD)

The authors of this paper invented a clever, lightweight system called CGD-PD. Think of it not as a new detective, but as a strict supervisor who stands next to the detective during the interrogation.

Here is how the supervisor works, step-by-step:

Step 1: The "Mirror Test" (Consistency Check)

Instead of asking the detective about the theory once, the supervisor asks twice:

  1. "Is the theory True?"
  2. "Is the theory False?" (This is the mechanical negation).

The supervisor checks the answers. If the detective says "True" to the first and "False" to the second, Great! The answers match the laws of logic. The supervisor accepts the answer.

Step 2: The "Pushy" Prompt (Fixing the Coward)

If the detective says "I don't know" to one of the questions, the supervisor doesn't just give up. They lean in and say:
"Okay, you say you don't know. But if you really don't know, tell me exactly which clue is missing. If you can't point to a missing clue, then you actually do know the answer. Pick one!"

This forces the detective to either admit they truly lack information or to finally commit to a "Yes" or "No."

Step 3: The "Sniper" Questions (Proof-Driven Disambiguation)

If the detective is still stuck saying "I don't know" to both sides, the supervisor switches tactics. Instead of asking for a full essay, they ask simple Yes/No questions:

  • "Do the clues prove the butler did it? Yes or No?"
  • "Do the clues prove the butler didn't do it? Yes or No?"

Because these questions are simpler (binary), the detective is less likely to get confused or play it safe. If the answer to the first is "Yes," the supervisor declares the case solved.

Step 4: The "Referee" (Adjudication)

If the detective gives two strong but contradictory answers (e.g., "Yes" to both), the supervisor acts as a referee. They look at the two answers and pick the one that makes the most logical sense, throwing out the contradictory one.


Why This Matters

The paper tested this system on a famous logic puzzle dataset called FOLIO.

  • The Result: By adding this "supervisor" layer, the AI got significantly smarter.
    • On one model (GPT-5.2), accuracy went up by 4.4%.
    • On another model (Claude Sonnet), accuracy jumped by 6.9%.
  • The Trade-off: The system has to ask the AI a few more questions (about 4 or 5 instead of just 1). But the paper argues that for important tasks, spending a little extra "brain power" to stop the AI from being confused or cowardly is totally worth it.

The Big Takeaway

This paper shows that we don't always need to build a bigger, more expensive AI to make it smarter. Sometimes, we just need to ask the right questions in the right order and force the AI to be consistent with itself. It's like giving a nervous student a second chance to check their work before handing in the test, ensuring they didn't accidentally contradict themselves.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →