Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Imagine you have hired a brilliant, incredibly fast, but slightly scatterbrained assistant to help you run a factory. This assistant (the Large Language Model or LLM) can write reports, plan schedules, and diagnose machine problems in seconds.

However, there's a catch: this assistant is prone to hallucinations. They don't just make up facts; they make up facts that sound perfect. They might tell you a machine part is broken when it's fine, or suggest a repair that would actually destroy the machine. In a factory, this isn't just annoying; it's dangerous and expensive.

The paper you shared is like a mechanic's guidebook for five different ways to train this assistant to stop guessing and start being reliable, without firing them or rebuilding their brain.

Here is a simple breakdown of the problem and the five solutions they tested, using everyday analogies.

The Problem: The "Confident Guess"

Imagine asking your assistant to plan a 3-day camping trip.

The Issue: If you ask them once, they might say, "Bring a tent, a sleeping bag, and a canoe."
The Hallucination: If you ask them again, they might say, "Bring a tent, a sleeping bag, and a submarine."
The Risk: In a factory, if the assistant tells you to use a "submarine" (a wrong part) instead of a "canoe" (the right part), the whole operation fails. The paper calls this lack of Epistemic Stability—the inability to get the same, reliable answer twice.

The researchers tested five "prompt engineering" tricks (ways of asking the question) to fix this.

The Five Strategies (The "Fixes")

1. M1: The "Echo Chamber" (Iterative Similarity)

The Idea: Ask the assistant the same question five times in a row. If they give you five slightly different answers, keep asking until they finally agree with themselves.
The Analogy: Imagine asking a friend, "What time is the movie?" five times. If they say "7 PM," then "7:05," then "7," and finally "7 PM" again, you know they are settled on the answer.
The Result: It worked okay (75% success). But sometimes, the assistant could agree with themselves on the wrong answer. It's like five friends all agreeing that the sky is green because they are all looking at a green filter.

2. M2: The "Translator" (Decomposed Prompting)

The Idea: Instead of asking for a whole complex plan at once, break it down. First, ask the assistant to just list the facts. Then, ask them to write the story based only on those facts.
The Analogy: Imagine asking a chef to "Make a lasagna." They might forget the cheese. Instead, you say: "First, list the ingredients you need. Second, write the recipe based only on that list."
The Result: Surprisingly, this failed at first (34% success). Why? Because the "translator" step accidentally threw away important details (like "don't forget the cheese") while listing the facts.
The Fix (v2): They changed the rule: "List the facts, but keep the original instructions as a checklist so you don't forget anything." This turned the failure into a huge success (80% success).

3. M3: The "Specialized Team" (Single-Task Agents)

The Idea: Instead of one person doing everything (diagnosing, ranking severity, fixing, and writing the report), use four different people, each doing just one job.
The Analogy: Imagine a car repair shop where one mechanic tries to diagnose the engine, fix the brakes, paint the car, and write the invoice all at once. They will get tired and make mistakes. Instead, have a Diagnostician, a Mechanic, a Painter, and a Clerk.
The Result: This worked very well (80% success).
The Fix (v2): They added a fifth person, a "Reconciler" or "Manager," whose only job is to check if the Diagnostician and the Mechanic are telling the same story. If the Diagnostician says "broken wheel" but the Mechanic says "flat tire," the Manager catches the contradiction. This boosted success to 100% in their small test.

4. M4: The "Cheat Sheet" (Enhanced Data Registry)

The Idea: Don't just give the assistant raw numbers (like "Sensor A = 100"). Give them a dictionary that explains what those numbers mean in plain English.
The Analogy: Imagine giving a student a math test with just the numbers "5, 10, 15" and asking for the answer. They might guess. Now, give them a cheat sheet that says "5 = Apples, 10 = Oranges, 15 = Bananas." Suddenly, the answer is obvious.
The Result: This was the biggest winner (100% success). By giving the assistant a clear map of what the machine parts actually are, they stopped guessing.
The Caveat: The researchers noted that the answers were also longer and more detailed, which might have tricked the "Judge" (another AI) into thinking they were better just because they looked more professional.

5. M5: The "Dictionary" (Domain Glossary)

The Idea: Industrial machines use weird abbreviations (like "AHU" or "VFD"). The assistant might not know what these mean. So, give them a mini-dictionary at the start of the conversation.
The Analogy: If you ask a general doctor about "VFD" (Variable Frequency Drive), they might be confused. If you hand them a card that says "VFD = A motor speed controller," they can do their job.
The Result: This worked well (77% success).
The Fix (v2): They tried to be smarter by only giving the dictionary entries relevant to the specific question, rather than the whole book. It worked okay, but the sample size was too small to be sure.

The "Judge" and the Final Verdict

To see which method worked, the researchers used a Judge.

The Problem: They used the same type of AI to act as the Judge. It's like asking a student to grade their own homework.
The Bias: The Judge tended to like answers that were longer and more structured. This might have unfairly boosted the "Cheat Sheet" method (M4) because it naturally produced longer answers.
The Human Check: They did a quick human review, and the humans agreed that the "Cheat Sheet" method actually did produce better, more useful answers, not just longer ones.

The Takeaway for Everyday Life

You don't need to be a computer scientist to use these ideas. If you are using AI tools for important work:

Don't just ask once. If the stakes are high, ask the AI to check its own work or ask the same question twice to see if the answer changes.
Give context, not just questions. Don't just say "Fix the machine." Say "Here is the machine manual, here is the error code, and here is what the error code means."
Break big tasks into small ones. Don't ask an AI to "Write a business plan and fix the code." Ask it to "List the business risks," then "Write the plan," then "Fix the code."
Define your jargon. If you use industry slang, give the AI a quick glossary so it doesn't guess.

In short: AI is a powerful tool, but it's like a very smart intern who needs clear instructions, a good dictionary, and a manager to check their work before they go out and do something critical. This paper shows us exactly how to be that good manager.

Here is a detailed technical summary of the paper "Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction."

1. Problem Statement

The paper addresses the critical challenge of hallucinations in Large Language Models (LLMs) within high-stakes industrial environments (e.g., HVAC engineering, Enterprise Resource Planning, IoT telemetry).

The Core Issue: LLM outputs are probabilistic and non-deterministic. While they may be syntactically coherent, they often contain factually incorrect or contextually inconsistent information.
Industrial Impact: In industrial settings, a single error in a multi-element output (e.g., a maintenance procedure or incident report) can cascade, leading to equipment damage, safety risks, or operational failures.
The Limitation of Current Metrics: Standard benchmarks measure average accuracy on held-out datasets, failing to guarantee correctness for specific, real-time inference instances.
Goal: The authors aim to achieve "Epistemic Stability"—defined not as philosophical certainty, but as the engineering ability to produce consistent, repeatable, and verifiable results without modifying model weights or training data.

2. Methodology

The study evaluates five prompt engineering strategies designed to reduce output variance and hallucinations. All methods were tested against an internal baseline (a single-shot prompt) using an LLM-as-Judge framework.

Evaluation Framework

Model: Azure OpenAI GPT-5-chat.
Protocol: 100 repeated runs (D1) for initial methods; 10 runs (D2) for enhanced versions.
Temperature: $\tau = 0.7$ for generation (stochastic); $\tau = 0.0$ for the judge (deterministic).
Judge: The same model acts as the judge, evaluating outputs on Accuracy, Clarity, and Directness, resulting in a "Better," "Same," or "Worse" verdict.

The Five Strategies

M1: Iterative Similarity Convergence
- Concept: Generates multiple responses and iterates until two consecutive outputs reach a semantic similarity threshold ( $\sigma_{sim} \geq 0.85$ ).
- Hypothesis: Stability (repeatability) is a proxy for correctness.
M2: Decomposed Model-Agnostic Prompting
- Concept: Splits the task into two steps: (1) Extract structured facts from the prompt, (2) Synthesize a response based only on those facts.
- Hypothesis: Narrowing the generative decision space reduces errors.
M3: Single-Task Agent Specialization
- Concept: Replaces a single multi-task agent with a chain of four specialized agents (Root Cause $\to$ Severity $\to$ Remediation $\to$ Post-Mortem).
- Hypothesis: Prevents error propagation where a mistake in the first step corrupts all subsequent steps.
M4: Enhanced Data Registry
- Concept: Injects a structured "enrichment layer" directly into the prompt context. Raw sensor IDs are augmented with human-readable metadata (component type, normal ranges, fault thresholds, causal dependencies).
- Hypothesis: Providing explicit physical context prevents the model from inferring relationships from statistical patterns.
M5: Domain Glossary Injection
- Concept: Prepend a controlled vocabulary of domain-specific acronyms (e.g., AHU, DX, VAV) to the system prompt to disambiguate polysemous terms.

Version 2 (v2) Enhancements

Based on D1 failures, the authors developed improved versions:

M1 v2: Replaced similarity convergence with Self-Critique and Refinement (generate draft $\to$ identify exactly 3 flaws $\to$ refine).
M2 v2: Context-Aware Synthesis. The synthesis step now receives both the extracted facts and the original prompt as a checklist to ensure no constraints are dropped.
M3 v2: Added a Reconciler Agent to review all four agent outputs for internal contradictions before finalizing the report.
M5 v2: Dynamic Glossary Retrieval. Only injects glossary terms relevant to the specific query to reduce token noise.

3. Key Results

D1 Baseline Results (100 Trials)

Method	"Better" Rate	"Worse" Rate	Key Finding
M1 (Iterative)	75%	7%	Stability is a weak proxy; similar outputs can share the same omission.
M2 (Decomposed)	34%	41%	Net Negative. The extraction step lost critical cross-cutting constraints (e.g., security, CI/CD) from the original prompt.
M3 (Agents)	80%	1%	Specialization effectively reduced cascading errors.
M4 (Registry)	100%	0%	Perfect score. Enriched context eliminated hallucinations in HVAC diagnostics.
M5 (Glossary)	77%	1%	Disambiguation significantly improved reliability.

D2 Verification Results (10 Trials, v2 Methods)

M1 v2 (Self-Critique): Improved to 100% "Better." Direct critique proved more effective than similarity convergence.
M2 v2 (Context-Aware): Recovered dramatically from 34% to 80%. Passing the original prompt as a checklist fixed the context loss issue.
M3 v2 (Consensus): Improved to 100% "Better." The Reconciler successfully caught internal contradictions.
M4 (Unchanged): Maintained 100% "Better."
M5 v2 (Dynamic): Dropped to 60% (statistically insignificant due to small sample size, $n=10$ ), but produced 0% "Worse" verdicts.

4. Key Contributions

Epistemic Stability Framework: Proposes a shift from seeking "absolute truth" to engineering "stable, verifiable procedures" for industrial AI.
Systematic Comparison: Provides a rigorous, internal-baseline comparison of five distinct strategies using a consistent LLM-as-Judge protocol.
Diagnosis and Correction: Identifies specific failure modes (e.g., M2's loss of cross-cutting constraints) and validates targeted fixes (M2 v2) that yield massive performance gains (+46 percentage points).
Practical Artifacts: Releases verbatim prompts, pseudocode, and schema definitions (e.g., the HVAC registry schema) for independent replication.
Honest Limitations: Explicitly acknowledges the "Same-Model Judge" bias (where the judge may prefer longer, more structured responses) and the narrow scope of the task set.

5. Significance and Implications

No-Code/No-Weight Solutions: The study demonstrates that significant hallucination reduction can be achieved through prompt engineering and data enrichment alone, without the need for fine-tuning, RLHF, or complex retrieval indices (RAG).
Industrial Viability: The results suggest that for domains with structured data (like HVAC/BMS), M4 (Enhanced Data Registry) is the most effective strategy, turning opaque sensor IDs into grounded, checkable facts.
Complex Task Handling: For complex, multi-constraint tasks, M2 v2 (Context-Aware Synthesis) is critical; simple decomposition without context preservation degrades performance.
Error Propagation: The success of M3 v2 highlights that in multi-step reasoning, a "Reconciler" or consensus mechanism is essential to prevent early errors from cascading through the entire output.
Caveats: The authors caution that while the results are promising, the "Same-Model Judge" design introduces potential bias (favoring longer outputs). Independent human evaluation or cross-model judging is required for definitive validation.

Conclusion: The paper argues that while LLMs cannot be made "epistemically certain" in a philosophical sense, industrial reliability can be engineered by constraining the model with structured context, specialized agents, and iterative self-correction, moving the industry toward repeatable and defensible AI outputs.