Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Imagine you have a very talented, fast-talking assistant who knows a lot of facts but also has a bad habit: they love to fill in the blanks, even when they don't actually know the answer. If you ask them a question they can't answer from your notes, they might just make something up that sounds really confident and logical. In the world of AI, this is called a hallucination.

This paper proposes a new way to stop that from happening. Instead of just hoping the AI tells the truth, the authors built a "security system" that acts like a bouncer at a club, checking if the AI's answer is actually backed up by evidence before letting it speak.

Here is the breakdown of their idea, using simple analogies:

1. The Core Problem: The "Confident Liar"

The authors argue that hallucination isn't just about the AI getting the facts wrong. It's about a boundary error.

The Analogy: Imagine a student taking a test. If they don't know the answer, they should raise their hand and say, "I don't know." But this AI student is trained to be fluent and confident. So, when they don't know the answer, they just write down a guess and hand it in as if it's a fact.
The Mistake: The AI treats its own "guessing" (internal thoughts) as if it were "evidence" (external facts). It crosses the line from "I'm guessing" to "Here is the truth" without realizing it.

2. The Solution: A Two-Layer Security System

The authors tried two different ways to stop the AI from lying, and found that neither one worked perfectly on its own. They had to combine them into a "Composite Architecture."

Layer A: The "Ask Yourself" Check (Instruction-Based Refusal)

This is like telling the AI: "Hey, if you don't have the answer in your notes, please say 'I don't know' instead of making something up."

How it works: It relies on the AI's own internal sense of what it knows.
The Flaw: Sometimes the AI is too confident. It thinks it knows the answer even when it doesn't (like a student who is sure they are right but is actually wrong). Also, on some models, the AI gets too scared and says "I don't know" even when it does know the answer (over-cautiousness).

Layer B: The "Fact-Checker" Gate (Structural Abstention)

This is a separate, mechanical gate that doesn't ask the AI what it thinks. Instead, it runs a quick math check on the answer before it's released. It looks at three signals:

Consistency: If you ask the same question three times, does the AI give the same answer?
Stability: If you rephrase the question, does the answer stay the same?
Citation: Can the AI point to the specific sentence in your notes that supports its answer?

If the answer fails these checks (the "Support Deficit Score" is too high), the gate slams shut and blocks the output.

The Flaw: If the AI is a "confident liar" that picks one side of a conflict and sticks to it perfectly, this gate gets fooled. The AI looks consistent and stable, so the gate lets the lie through.

3. The Winning Combo: The Composite System

The authors realized that Layer A and Layer B catch different types of mistakes.

Layer A catches the AI when it's too scared to answer or when it's confused by conflicting notes.
Layer B catches the AI when it's being too confident and ignoring the instructions.

The Result: By using both at the same time (like having a human teacher and a strict proctor), they achieved near-perfect results:

Accuracy: 96–98% of the time, the AI gave the right answer or correctly said "I don't know."
Hallucinations: Dropped to almost 0% (0–4%).

4. The Stress Test: The "TruthfulQA" Challenge

To make sure this system wasn't just lucky, they tested it on a "No-Context" challenge. They gave the AI 100 tricky questions with no notes at all and told it to answer.

The Goal: The AI should say "I don't know" 100% of the time because there is no evidence.
The Result:
- The "Ask Yourself" method failed on the weaker AI model (it lied 38% of the time).
- The "Fact-Checker" gate worked perfectly (it blocked 100% of the lies).
- The Combined System worked perfectly across all models.

The Big Takeaway

You can't just tell an AI to "be honest" (it might ignore you or get overconfident), and you can't just rely on a math check (it might get fooled by a confident liar).

The solution is a team effort: Use the AI's own instructions to catch some errors, and use a mechanical "safety gate" to catch the rest. It's like having a driver who promises to drive safely, plus an automatic braking system that stops the car if it detects a hazard the driver missed.

The Catch: This system is expensive. To check the answer, the AI has to "think" about the question multiple times (to check consistency) and run extra calculations. It's like hiring three different experts to double-check one answer. It's great for high-stakes situations (like medical or legal advice) where being wrong is dangerous, but maybe too slow for casual chat.

1. Problem Definition: Hallucination as Boundary Misclassification

The paper reframes the problem of Large Language Model (LLM) hallucination not merely as a content generation error, but as a misclassification error at the output boundary.

The Core Issue: LLMs generate text autoregressively, often filling "epistemic gaps" (missing information) with fluent but unsupported completions. The failure occurs when the system misclassifies these internally generated, prior-driven completions as evidence-backed answers.
The Loop: The paper identifies a self-reinforcing loop: Query → Gap → Prior-only completion → Emitted as Answer → User Acceptance.
Limitation of Current Methods: Existing strategies (post-hoc detection, verifiers, self-consistency voting) operate after the hallucination has been generated. The authors argue for a pre-output control mechanism that blocks emission before the boundary is crossed.

2. Methodology: The Composite Abstention Architecture

The proposed solution is a Composite Architecture that combines two complementary mechanisms: Instruction-based Refusal and a Structural Abstention Gate.

A. The Structural Gate (Black-Box Support-Deficit Score)

The gate computes a support-deficit score ( $S_t$ ) using three black-box signals, requiring no access to model internals:

Self-Consistency ( $A_t$ ): The fraction of agreement among $K=3$ independent generations (Temperature $T=0.7$ ).
Paraphrase Stability ( $P_t$ ): Semantic overlap between the original response and a response generated from a rephrased query.
Citation Coverage ( $C_t$ ): The fraction of content words in the response traceable to the provided context (computed via keyword overlap).

The score is calculated as:
$S_t = 1 - \frac{A_t + P_t + C_t}{3}$

Abstention Policy: If $S_t > \tau$ (where threshold $\tau = 0.55$ ), the system blocks the output and returns an abstention signal.

B. The Instruction-Based Refusal

A system prompt instructs the model to abstain if evidence is insufficient. This relies on the model's internal alignment and instruction-following capabilities.

C. The Composite Logic

The final architecture uses a logical OR condition: Output is blocked if either the instruction-based refusal triggers or the structural gate triggers ( $S_t > \tau$ ).

3. Experimental Design

The study evaluated the architecture across three models (GPT-3.5-turbo, GPT-4o-mini, GPT-4o) under four conditions:

Baseline: Standard generation.
Instruction-only: Prompted to refuse without structural enforcement.
Hard-gated: Structural gate only.
Composite: Both mechanisms active.

Evaluation Regimes (50 items):

R1 (Answerable): Factual queries with sufficient context.
R2 (Unanswerable): Empty or irrelevant context.
R3 (Conflicting Evidence): Sources provide contradictory answers.
R4 (Degraded Retrieval): Context is topically related but lacks the answer.
R5 (Adversarial Pressure): Unanswerable queries demanding confident answers.

Stress Test (100 items):

Derived from TruthfulQA with empty context to test abstention enforcement capability independent of factual accuracy.

4. Key Results

A. Overall Performance (50-Item Regime)

Baseline: High hallucination rates (30–50%).
Instruction-only: Reduced hallucination significantly (0% for GPT-4o/mini, 6% for GPT-3.5) but introduced over-cautious abstention (10% of answerable items were refused by GPT-4o/mini).
Structural Gate: Preserved 100% accuracy on answerable items (no over-abstention) but failed on confident confabulation (70% hallucination rate on conflicting evidence items, as the model was internally consistent but wrong).
Composite Architecture: Achieved 96–98% overall accuracy with 0–4% hallucination. It successfully mitigated the failure modes of the individual components.

B. TruthfulQA Stress Test (Capability Independence)

Instruction-only: Performance degraded with model capability. GPT-4o/mini abstained 100% of the time, but GPT-3.5-turbo only abstained 62% (hallucinating 38%).
Structural Gate: Enforced 98–100% abstention across all models, regardless of capability.
Composite: Maintained the high abstention floor of the gate while retaining the instruction component's benefits.

C. Failure Mode Analysis

Gate Failure: "Confident Confabulation." In conflicting evidence scenarios, the model picks one side, generating high self-consistency and stability scores, tricking the gate into allowing the output.
Instruction Failure: "Over-cautious Abstention" (on answerable items for stronger models) and "Residual Hallucination" (on conflicting items for weaker models).

5. Key Contributions

Theoretical Framing: Reconceptualizes hallucination as a classification error at the output boundary rather than a generative error.
Black-Box Metric: Introduces a support-deficit score ( $S_t$ ) derived from three externally measurable signals ( $A_t, P_t, C_t$ ).
Composite Solution: Demonstrates that combining instruction-based refusal with structural gating is necessary to achieve near-zero hallucination, as neither mechanism alone covers all failure modes.
Capability-Independent Floor: Proves that structural gating provides a safety floor that does not degrade with model capability, unlike instruction-following.

6. Significance and Implications

Complementary Mechanisms: The study highlights that effective hallucination control requires both internal self-assessment (instruction) and external structural verification (gate).
Beyond Accuracy: The paper argues that evaluation should not focus solely on final answer correctness but on support stability (whether the system correctly identified the evidential boundary before emitting).
Practical Trade-offs: The composite approach requires significant computational overhead (~22 API calls per query for $K=3$ ). This is justified for high-stakes domains (medical, legal) but may be prohibitive for casual use.
Future Directions: The authors suggest extending the gate with explicit source-conflict detection to address the "confident confabulation" failure mode and call for cross-family validation (e.g., Llama, Claude) to establish generalizability.

In conclusion, the paper provides a proof-of-concept that treating hallucination as a boundary control problem, managed by a composite architecture, significantly outperforms single-mechanism approaches in reducing hallucinations while maintaining answerability.