Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

Imagine you are hiring a very smart, but slightly nervous, detective to solve a mystery. You give them a set of clues (the Premises) and a specific theory about what happened (the Hypothesis).

Your detective has three possible answers:

"Yes, it's true" (The clues prove the theory).
"No, it's false" (The clues prove the theory is wrong).
"I don't know" (The clues aren't enough to decide).

The problem is, modern AI detectives (Large Language Models) are great at solving easy cases, but they often get tripped up by two specific bugs:

The Two Bugs in the Detective's Brain

1. The "Flip-Flop" Bug (Negation Inconsistency)
Imagine you ask the detective: "Is it true that the butler did it?" They say, "Yes."
Then, you immediately ask: "Is it true that the butler did NOT do it?"
A logical human would say, "Well, if he did it, then he didn't not do it, so the answer is No."
But the AI detective might get confused and say, "Yes, he didn't do it" or "I don't know." It's like a person who can't keep their story straight when you ask the same question in a slightly different way.

2. The "Cowardly" Bug (Epistemic Unknown)
Sometimes, the clues are actually clear enough to solve the case. But the detective is just a bit nervous or unsure, so they play it safe and say, "I don't know." They are hiding behind a "I don't know" shield even when they actually have enough evidence to make a call. This makes the detective look safe, but useless.

The Solution: The "Double-Check" System (CGD-PD)

The authors of this paper invented a clever, lightweight system called CGD-PD. Think of it not as a new detective, but as a strict supervisor who stands next to the detective during the interrogation.

Here is how the supervisor works, step-by-step:

Step 1: The "Mirror Test" (Consistency Check)

Instead of asking the detective about the theory once, the supervisor asks twice:

"Is the theory True?"
"Is the theory False?" (This is the mechanical negation).

The supervisor checks the answers. If the detective says "True" to the first and "False" to the second, Great! The answers match the laws of logic. The supervisor accepts the answer.

Step 2: The "Pushy" Prompt (Fixing the Coward)

If the detective says "I don't know" to one of the questions, the supervisor doesn't just give up. They lean in and say:
"Okay, you say you don't know. But if you really don't know, tell me exactly which clue is missing. If you can't point to a missing clue, then you actually do know the answer. Pick one!"

This forces the detective to either admit they truly lack information or to finally commit to a "Yes" or "No."

Step 3: The "Sniper" Questions (Proof-Driven Disambiguation)

If the detective is still stuck saying "I don't know" to both sides, the supervisor switches tactics. Instead of asking for a full essay, they ask simple Yes/No questions:

"Do the clues prove the butler did it? Yes or No?"
"Do the clues prove the butler didn't do it? Yes or No?"

Because these questions are simpler (binary), the detective is less likely to get confused or play it safe. If the answer to the first is "Yes," the supervisor declares the case solved.

Step 4: The "Referee" (Adjudication)

If the detective gives two strong but contradictory answers (e.g., "Yes" to both), the supervisor acts as a referee. They look at the two answers and pick the one that makes the most logical sense, throwing out the contradictory one.

Why This Matters

The paper tested this system on a famous logic puzzle dataset called FOLIO.

The Result: By adding this "supervisor" layer, the AI got significantly smarter.
- On one model (GPT-5.2), accuracy went up by 4.4%.
- On another model (Claude Sonnet), accuracy jumped by 6.9%.
The Trade-off: The system has to ask the AI a few more questions (about 4 or 5 instead of just 1). But the paper argues that for important tasks, spending a little extra "brain power" to stop the AI from being confused or cowardly is totally worth it.

The Big Takeaway

This paper shows that we don't always need to build a bigger, more expensive AI to make it smarter. Sometimes, we just need to ask the right questions in the right order and force the AI to be consistent with itself. It's like giving a nervous student a second chance to check their work before handing in the test, ensuring they didn't accidentally contradict themselves.

1. Problem Statement

The paper addresses Three-Way Logical Question Answering (QA), a task where a model must assign a label to a hypothesis ( $H$ ) given a set of premises ( $S$ ). The labels are:

True: $S \models H$ (Entailment)
False: $S \models \neg H$ (Contradiction)
Unknown: Neither $H$ nor $\neg H$ is entailed by $S$ (Underspecification).

The authors identify two critical failure modes in modern Large Language Models (LLMs) when performing this task:

Negation Inconsistency: Due to the deterministic relationship between $H$ and $\neg H$ (if $S \models H$ , then $S \not\models \neg H$ ), LLMs often produce contradictory labels when queried independently on $H$ and its negation $\neg H$ . Standard prompting treats these as independent inputs, violating logical constraints.
Epistemic Unknown: Models frequently predict "Unknown" not because the logic is genuinely underspecified, but due to uncertainty, instability, or conservative behavior. This "epistemic abstention" reduces accuracy and coverage without reflecting the true logical state of the problem.

2. Methodology: CGD-PD

The authors propose CGD-PD (Consistency-Guided Decoding with Proof-Driven Disambiguation), a lightweight, training-free test-time wrapper that enforces logical consistency and resolves uncertainty. It operates on a single 3-way classifier and uses a multi-stage decision process:

A. Dual Probing (Consistency Check)

Given $(S, H)$ , the system queries the base classifier twice:

$y_H = \text{Classify}(S, H)$
$y_{\neg H} = \text{Classify}(S, \neg H)$ (using a mechanically negated form).

If the pair $(y_H, y_{\neg H})$ satisfies the Negation Mapping ( $y_{\neg H} = \text{NegMap}(y_H)$ ) and at least one side is decisive (True/False), the system accepts the result immediately.

B. Targeted Unknown Fixing

If one side is "Unknown," the system invokes a Proof-Driven Disambiguation step specifically for that side. It uses a "FixUnknown" prompt that asks the model to:

Provide a decisive label (True/False) only if supported by a specific premise (a "witness").
Otherwise, maintain "Unknown" and identify missing information.
If one side becomes decisive after this step, the other is projected via the negation mapping.

C. Binary Entailment Probes (Proof-Driven Disambiguation)

If both sides remain "Unknown" after fixing, the system switches to Binary Entailment Probes (Yes/No questions) rather than 3-way classification.

It asks: "Does $S$ entail $H$ ?" ( $b_H$ ) and "Does $S$ entail $\neg H$ ?" ( $b_{\neg H}$ ).
Decision Rule:
- If $(b_H, b_{\neg H}) = (\text{Yes}, \text{No}) \rightarrow$ True.
- If $(b_H, b_{\neg H}) = (\text{No}, \text{Yes}) \rightarrow$ False.
- Otherwise (e.g., both No, or conflicting Yes) $\rightarrow$ Unknown.
This step is designed to be less prone to overusing "Unknown" than 3-way classification.

D. Adjudication

If both $H$ and $\neg H$ yield decisive but contradictory labels (e.g., both True), a lightweight adjudicator prompt selects the most consistent assignment.

Compute Cost: The method averages 4–5 model calls per example (2 for dual probing, plus conditional calls for fixing and binary probes), compared to 1 call for a baseline.

3. Key Contributions

Diagnosis of Failure Modes: The paper isolates and quantifies "negation inconsistency" and "epistemic Unknown" as primary bottlenecks in 3-way logical QA, using the formal annotations of the FOLIO benchmark.
CGD-PD Framework: Introduces a novel, training-free inference-time wrapper that couples logically related prompts ( $H$ and $\neg H$ ) and uses targeted binary probes to resolve uncertainty, enforcing a hard logical constraint (negation consistency).
Empirical Validation: Demonstrates that enforcing minimal logical structure at inference time significantly improves accuracy and reduces unnecessary abstention across frontier LLMs.

4. Experimental Results

The method was evaluated on the FOLIO benchmark (specifically the First-Order Logic fields), using two state-of-the-art models: GPT-5.2 and Claude Sonnet 4.5.

Model	Method	Accuracy (%)	Unknown Rate (%)	Epistemic Unknown Rate (%)	Avg. Calls
GPT-5.2	Single Call	63.7	57.4	41.5	1.00
	CGD-PD	68.1	53.9	36.3	4.36
Claude Sonnet 4.5	Single Call	42.2	75.5	72.6	1.00
	CGD-PD	49.0	58.8	53.3	4.91

Key Findings:

Accuracy Gains: CGD-PD improved accuracy by +4.4 points on GPT-5.2 and +6.9 points on Claude Sonnet 4.5.
Reduced Abstention: The frequency of "Unknown" predictions dropped significantly (e.g., -16.7 points for Claude), indicating that many previous "Unknown" predictions were epistemic (due to uncertainty) rather than genuine logical underspecification.
Confusion Matrix Analysis: The gains primarily come from correctly converting "Unknown" predictions to "True" or "False" on examples where the gold label was decisive. There was a modest trade-off on genuine "Unknown" cases, but the net effect was positive.

5. Significance and Impact

Efficiency vs. Complexity: Unlike heavy reasoning pipelines (e.g., Tree-of-Thoughts) or full logical solvers, CGD-PD is a lightweight, modular layer that requires no model fine-tuning. It achieves significant gains with a manageable increase in inference cost (~4-5x calls).
Logical Robustness: It demonstrates that LLMs can be made more robust to logical contradictions by explicitly enforcing structural constraints (negation consistency) during decoding.
Practical Application: The method is particularly valuable for high-stakes applications (e.g., legal analysis, verification systems) where "safe" abstention (predicting Unknown) is less useful than a confident, logically consistent decision.
Future Directions: The authors suggest extending this approach to richer logical transformations beyond simple negation and integrating hybrid neuro-symbolic verification layers.