VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

Imagine you are asking a very smart, well-read librarian for advice on a complex medical question, like "What is the best treatment for a specific type of headache?"

In the past, if you asked a standard AI (like a basic chatbot), it might give you a confident, eloquent answer that sounds perfect but is actually made up. This is called a "hallucination." It's like the librarian confidently telling you a story about a fictional doctor they invented, just because they want to sound helpful.

VerifAI is a new, open-source system designed to fix this problem. Think of it not as a single librarian, but as a three-person investigative team working together to give you the truth.

Here is how the VerifAI team works, using a simple analogy:

1. The Researcher (Information Retrieval)

The Job: Before answering, this team member runs to the library's massive archive (PubMed, which has millions of medical papers) to find the most relevant books.
The Trick: They don't just look for exact word matches (like a simple keyword search). They use a "smart search" that understands the meaning of your question.
The Result: They bring back the top 10 most relevant scientific abstracts (summaries of research papers) to the table.

2. The Writer (Generative Component)

The Job: This is the person who actually writes the answer for you. They read the 10 papers the Researcher brought back.
The Rule: They are strictly forbidden from using their own memory or making things up. They can only write what they find in those 10 papers.
The Superpower: Every single sentence they write must come with a "receipt." If they say "Drug X works," they must immediately attach a citation like (See Paper #12345).
The Upgrade: The team trained this writer specifically to be a "citation pro." Unlike other AIs that might forget to cite their sources, this one is fine-tuned to always point to the evidence.

3. The Fact-Checker (Verification Component)

The Job: This is the most important part. Before the answer is shown to you, a strict auditor (the Fact-Checker) reads every single sentence the Writer produced.
The Process: The Fact-Checker takes the sentence and the specific paper it claims to come from. They ask: "Does this paper actually prove this sentence?"
- Green Light: The paper supports the claim. (The sentence turns Green).
- Yellow Light: The paper is related but doesn't fully prove it. (The sentence turns Yellow).
- Red Light: The paper actually says the opposite, or the sentence isn't in the paper at all. (The sentence turns Red).
The Magic: This Fact-Checker is so good at spotting lies that it actually beats the most powerful, expensive AI models (like GPT-4) at finding errors in medical texts. It's like hiring a specialized detective who knows the law better than a generalist police officer.

Why is this a big deal?

1. No More "Fake News" in Medicine
In the real world, getting medical advice from a hallucinating AI could be dangerous. VerifAI ensures that if you get an answer, you can see exactly where it came from. If a sentence is red, you know to ignore it. If it's green, you know it's backed by science.

2. It's a Team Sport, Not a Solo Act
Most AI systems try to do everything in one giant brain. VerifAI splits the work: one part finds info, one part writes, and one part checks. This makes the whole system more reliable.

3. It's Open Source (Free for Everyone)
The creators didn't lock this technology in a vault. They released the code, the models, and the data for free. This means any hospital, researcher, or developer can use it, tweak it, or build upon it to make their own "truthful" search engines.

The User Experience

When you use VerifAI, you don't just get a block of text. You get a color-coded report:

Green sentences are safe and verified.
Red sentences are flagged as potential errors or unsupported claims.
If you hover your mouse over a sentence, it shows you the exact line in the original scientific paper that proves it.

The Bottom Line

VerifAI is like giving your AI a magnifying glass and a red pen. It forces the AI to stop guessing and start proving. It turns a "black box" that might lie into a transparent, trustworthy tool that helps us navigate the complex world of medical science without getting misled.

1. Problem Statement

Large Language Models (LLMs) have revolutionized information access but suffer from hallucinations—generating plausible but factually incorrect or unsupported statements. This is particularly critical in biomedicine, where factual accuracy is vital for patient safety and scientific integrity.

Limitations of Current RAG: Standard Retrieval-Augmented Generation (RAG) systems retrieve relevant documents but often fail to ensure that the generated text strictly aligns with the retrieved evidence. They frequently produce "citation hallucinations" (citing papers that do not support the claim) or generate claims not found in the source text.
Trust Gap: The inability to verify the lineage of information prevents the widespread adoption of generative AI in high-stakes domains like healthcare and research.

2. Methodology: The VerifAI Architecture

VerifAI is an open-source, modular expert system designed to generate verifiable answers by integrating Retrieval-Augmented Generation (RAG) with a post-hoc claim verification mechanism. The system consists of three distinct components:

A. Information Retrieval (IR) Component

Goal: Retrieve the most relevant scientific abstracts from PubMed (approx. 25.5 million abstracts).
Hybrid Search Strategy: Combines Lexical Retrieval (using OpenSearch with BM25) and Semantic Retrieval (using Qdrant with HNSW and dense embeddings from a bi-encoder sentence transformer).
Optimization: Implements a normalization and weighted scoring mechanism ( $\alpha \cdot \text{lexical} + \beta \cdot \text{semantic}$ ) to balance keyword precision with semantic flexibility. The system is optimized to return the top 10 relevant abstracts per query.

B. Generative Component (GC)

Model: A fine-tuned Mistral-7B-Instruct-v0.2 (32k context window).
Training Strategy: Fine-tuned on a custom dataset (PQAref) containing 9,075 questions derived from PubMedQA, paired with 10 retrieved abstracts and GPT-4 generated answers.
Technique: Used QLoRA (4-bit quantization) for efficient training on a single NVIDIA A100 GPU.
Output: Generates concise answers where every factual claim is immediately followed by a citation (PubMed ID). The model was specifically tuned to minimize "hallucinated PMIDs" (citing papers not in the context).

C. Verification Component (VC)

Goal: Perform claim-level factuality assessment to detect hallucinations.
Mechanism: Decomposes the generated answer into atomic claims. For each claim, it retrieves the referenced evidence (the full abstract) and classifies the relationship as Support, Contradict, or No Evidence.
Model: A fine-tuned DeBERTa-large model (specifically trained on a transformed version of the SciFact dataset).
Innovation: Unlike general LLMs, this component acts as a discriminative Natural Language Inference (NLI) engine, strictly checking logical entailment between the claim and the provided premise (abstract).
User Interface: Visualizes results using color-coding (Green = Supported, Yellow = Partially Supported, Red = Contradicted, Gray = Unreferenced).

3. Key Contributions

Fine-Tuning Strategy for SLMs: Demonstrated that Small Language Models (SLMs), when properly fine-tuned on domain-specific data, can achieve citation fidelity comparable to or better than frontier LLMs (like GPT-4), challenging the notion that massive models are required for verifiable QA.
Specialized NLI Discriminators: Showed that specialized NLI models significantly outperform general-purpose generative models (including GPT-4, GPT-4 Turbo, and GPT-4o) in detecting hallucinations and verifying claims on biomedical benchmarks (HealthVer).
End-to-End Open-Source Pipeline: Released the first fully open-source, modular pipeline integrating hybrid retrieval, citation-aware generation, and post-hoc entailment verification. This includes code, models, and datasets (PQAref, transformed SciFact).

4. Experimental Results

The system was evaluated on the BioASQ dataset (for IR and end-to-end QA) and SciFact/HealthVer (for verification).

Information Retrieval:
- The hybrid approach achieved a MAP@10 of 42.7% and P@10 of 30.8% on BioASQ, significantly outperforming the standard PubMed search engine (MAP@10 of 19.1%).
- Hybrid search (Lexical 0.7 + Semantic 0.3) provided the best balance.
Generative Component:
- Hallucination Reduction: Fine-tuning reduced hallucinated PMIDs from 26 (in zero-shot Mistral) to just 3 in the fine-tuned model.
- Citation Fidelity: The fine-tuned model referenced the most relevant abstract in 98.8% of cases (vs. 77.5% for zero-shot).
- Content Quality: Achieved a BERTScore of 0.90, matching GPT-4 Turbo's performance.
Verification Component:
- Benchmark Performance: The DeBERTa model achieved an F1-score of 0.88 on the in-domain SciFact test set.
- Out-of-Domain (HealthVer): Achieved an F1-score of 0.48 and accuracy of 0.52, outperforming the previous state-of-the-art (BERT-base) by 8-11 percentage points.
- Comparison with GPT-4: The fine-tuned DeBERTa model consistently outperformed GPT-4, GPT-4 Turbo, and GPT-4o in both accuracy and F1-score on claim verification tasks.
End-to-End Evaluation:
- On 178 BioASQ questions, the system generated answers that reached the same conclusion as human experts in 81.5% of cases when using gold-standard abstracts.
- The verification engine successfully identified supported claims with high reliability, though "Contradiction" detection remained the most challenging class due to subtle semantic nuances.

5. Significance and Impact

Trust and Transparency: VerifAI addresses the "black box" nature of LLMs by providing a transparent lineage for every claim, allowing users to verify sources instantly.
Cost-Effectiveness: By proving that fine-tuned SLMs can outperform massive models in verification and citation tasks, the system offers a more computationally efficient and accessible solution for high-stakes domains.
Domain Adaptability: While optimized for biomedicine, the modular architecture allows for adaptation to other fields (e.g., law, finance) by swapping the retrieval corpus and retraining the NLI verifier.
Open Science: The full release of code, datasets, and models encourages reproducibility and further research into reliable AI deployment.

In conclusion, VerifAI represents a significant step forward in mitigating AI hallucinations in scientific domains by decoupling generation from verification, ensuring that every generated statement is logically grounded in retrieved evidence.