Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

Imagine you've just hired a new, incredibly smart assistant named "The AI." You want to know if this assistant is ready to help you make important health decisions for your family. You ask, "Do you know the latest rules about flu shots?" or "What should I do if I get a rash?"

This paper is essentially a report card for 24 of these AI assistants, testing their knowledge of the UK Government's public health advice. The researchers from the UK Health Security Agency (UKHSA) built a giant, custom-made exam called PubHealthBench to see how well these AIs perform.

Here is the breakdown of their findings, using some simple analogies:

1. The Exam: A Massive Library Quiz

The researchers didn't just make up random questions. They took over 687 official government health documents (like a massive library of rulebooks) and used a robot to turn them into 8,000+ questions.

The Multiple Choice Test (MCQA): This is like a standard school exam. The AI is given a question and a list of answers (A, B, C, D, etc.). It just has to pick the right one.
The Free-Form Test: This is like a conversation. The AI has to explain the answer in its own words, just like a real chatbot would.

2. The Results: The "Smart" vs. The "Chatty"

The Multiple Choice Test: The A-Students

When the AI took the multiple-choice test, the top models (like GPT-4.5 and o1) were brilliant.

The Score: They got over 90% correct.
The Comparison: They actually did better than a regular human who was allowed to use Google to look up the answers (but wasn't a medical expert).
The Analogy: Imagine a student who has memorized the entire encyclopedia. If you ask, "What is the capital of France?" or "What is the rule for X?", they can instantly point to the right page. They are incredibly good at retrieving facts when given a hint (the multiple-choice options).

The Free-Form Test: The "Hallucinating" Storyteller

When the researchers asked the same AIs to explain the answers in their own words (without the multiple-choice hints), the scores dropped significantly.

The Score: No model scored above 75%.
The Problem: The AIs started to "hallucinate." This is like a student who knows the facts but gets nervous and starts making up details to sound smart. They might add extra advice that isn't in the official rules, or they might get the timing of a medical intervention slightly wrong.
The Analogy: Imagine asking that same encyclopedia-student to write an essay. Instead of just stating the facts, they might start rambling, inventing a story about a "new" rule that doesn't exist, or forgetting a crucial detail because they are trying to be too conversational.

3. The "Public" vs. The "Pro"

The researchers noticed something interesting about what the AIs knew best:

Public Advice: The AIs were great at answering questions meant for regular people (e.g., "When should I wash my hands?").
Clinical Advice: They struggled more with complex advice meant for doctors and specialists.
The Analogy: Think of the AI as a tour guide. They are excellent at telling tourists (the public) where the nearest bathroom is or what the opening hours are. But if you ask them for a detailed architectural blueprint of the building (clinical advice), they might get confused or give you a vague answer.

4. The Small vs. The Big Models

There was a huge gap between the "Super AIs" (expensive, proprietary models) and the "Mini AIs" (smaller, open-source models).

The Big AIs: Like a senior professor with a PhD. They are reliable but still make mistakes when talking freely.
The Small AIs: Like a bright high schooler. They can do okay on a multiple-choice test, but when asked to write an essay, they are much more likely to make up facts. The gap in performance between the big and small models grew massive in the free-form test.

The Bottom Line: A Warning Label

The paper concludes with a very important message for anyone thinking about using AI for health advice:

"The AI is a great librarian, but a risky doctor."

Good News: If you use AI to find information (like a search engine), the latest models are very accurate and can help you find the right government guidance quickly.
Bad News: If you let the AI explain the advice or give you a plan in a chat, it might invent rules or miss crucial details.

The Takeaway: We shouldn't just trust the AI to give us medical advice directly. We need to treat it like a helpful assistant that points you to the official rulebook, but we must always double-check its "free-form" stories with a human expert or the official government website. The AI is getting smarter, but it still needs a safety net.

Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

1. The Exam: A Massive Library Quiz

2. The Results: The "Smart" vs. The "Chatty"

The Multiple Choice Test: The A-Students

The Free-Form Test: The "Hallucinating" Storyteller

3. The "Public" vs. The "Pro"

4. The Small vs. The Big Models

The Bottom Line: A Warning Label

1. Problem Statement

2. Methodology: PubHealthBench

A. Data Collection & Pre-processing

B. Automated Question Generation

C. Quality Assurance & Validation

D. Evaluation Setup

3. Key Contributions

4. Key Results

A. MCQA Performance (Multiple Choice)

B. Free-Form Performance (Open-Ended)

C. Audience & Domain Analysis

5. Significance and Implications

Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

1. The Exam: A Massive Library Quiz

2. The Results: The "Smart" vs. The "Chatty"

The Multiple Choice Test: The A-Students

The Free-Form Test: The "Hallucinating" Storyteller

3. The "Public" vs. The "Pro"

4. The Small vs. The Big Models

The Bottom Line: A Warning Label

1. Problem Statement

2. Methodology: PubHealthBench

A. Data Collection & Pre-processing

B. Automated Question Generation

C. Quality Assurance & Validation

D. Evaluation Setup

3. Key Contributions

4. Key Results

A. MCQA Performance (Multiple Choice)

B. Free-Form Performance (Open-Ended)

C. Audience & Domain Analysis

5. Significance and Implications

More like this

Comparison of Outlier Detection Algorithms on String Data

Structure-Aware Epistemic Uncertainty Quantification for Neural Operator PDE Surrogates

Interventional Time Series Priors for Causal Foundation Models

Fingerprinting Concepts in Data Streams with Supervised and Unsupervised Meta-Information

Graph Tokenization for Bridging Graphs and Transformers