Large Language Models Readability Classification: A Variability Analysis of Sources and Metrics

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to explain a complex medical diagnosis to a friend who has no medical training. You want the information to be true (so they don't get sick) and easy to understand (so they actually listen).

This paper is like a report card for seven different "AI Chatbots" (like advanced versions of Siri or ChatGPT) trying to do exactly that: explain hearing health issues to patients in English and Portuguese.

Here is the story of what they found, using some simple analogies:

1. The Setup: The "Cooking Class"

The researchers gave seven different AI chefs the same recipe card (based on World Health Organization guidelines about hearing aids). They asked them to cook up an explanation for patients.

They tested the chefs in two ways:

Round 1 (The Baseline): "Just use what you already know." (The chefs cooked from their own memory).
Round 2 (The Wikipedia Grounding): "Go look up the facts on Wikipedia first, then cook." (The chefs were forced to use a specific, verified source).

They then hired five different "Food Critics" (Readability Metrics) to grade the dishes. These critics didn't taste the food; they just measured the ingredients (sentence length, word difficulty, syllable counts) to see how hard the dish would be to chew.

2. The Big Surprise: The "Wikipedia Twist"

What happened in Round 1?
When the chefs cooked from their own memory, they were surprisingly similar. They all produced dishes that were roughly the same "difficulty level." It was like a group of students taking a test without a textbook; they all gave answers that were about equally hard to read.

What happened in Round 2?
When the chefs were told to use Wikipedia, chaos broke out.
Suddenly, the same recipe produced wildly different results depending on which chef you asked.

Chef A (e.g., ChatGPT) took the Wikipedia facts and made a simple, easy-to-eat salad.
Chef B (e.g., Copilot) took the exact same Wikipedia facts and served a complex, heavy steak that was very hard to chew.

The Lesson: Even when you give AI the same verified facts, different AI models process them differently. Some summarize simply; others copy the complex language of the source. This means that if you switch AI models, the "difficulty" of your health advice might change overnight, even if the facts stay the same.

3. The Second Surprise: The "Confused Critics"

The researchers also looked at the five "Food Critics" (the readability metrics). They expected the critics to agree on how hard the food was.

They didn't.

Critic #1 (who counts syllables) said, "This is a gourmet, difficult meal!"
Critic #2 (who counts characters) said, "No, this is a simple snack!"

The Lesson: There is no single "Readability Score." Depending on which tool you use to measure the text, you might get a completely different grade. If you rely on just one tool, you might think a text is easy when it's actually confusing, or vice versa.

4. The "Techno-Solution" Trap

The paper warns against a common trap called "Techno-solutionism." This is the belief that "If we just give the AI better facts (like Wikipedia), everything will be perfect."

The study shows this isn't true.

The Trade-off: Giving the AI verified facts (Wikipedia) makes the information more accurate, but it often makes it harder to understand because the AI gets stuck trying to copy the complex language of the source.
The Risk: You might end up with a health guide that is 100% true but written in a language so complex that the patient gives up and doesn't read it.

5. The Bottom Line: What Should We Do?

The authors suggest three rules for anyone using AI for health advice:

Don't trust just one AI: Different models act differently. You can't assume they are interchangeable.
Don't trust just one score: Don't rely on a single "readability number." Use a battery of different tests to get the full picture.
Keep a human in the loop: Just because an AI says something is "true" doesn't mean it's "accessible." We need to check if the patient can actually understand it.

In a nutshell:
Imagine you are building a bridge for people to cross a river (getting health information).

Accuracy is making sure the bridge doesn't collapse (using Wikipedia facts).
Readability is making sure the bridge has a ramp so everyone can walk across, not just professional climbers.

This paper found that while we are getting better at building strong bridges (accuracy), we are accidentally building some with steep, dangerous cliffs (complexity) depending on which construction crew (AI model) we hire. We need to check the ramps carefully, no matter how strong the bridge is.

1. Problem Statement

While Large Language Models (LLMs) are increasingly used in healthcare to disseminate information, current research focuses heavily on factual accuracy (preventing "hallucinations") while neglecting linguistic accessibility. Accurate medical information is ineffective if patients cannot understand it.

The Gap: There is a lack of understanding regarding how different LLM architectures and different readability metrics classify the complexity of health-related text.
The Trade-off: Retrieval-Augmented Generation (RAG), which grounds LLM responses in verified sources like Wikipedia, is used to improve factual accuracy. However, it is unknown whether this grounding strategy inadvertently increases text complexity or creates inconsistent readability outcomes across different models and languages.
The Risk: If readability varies significantly between models or metrics, patients in low-resource settings or with lower health literacy may face barriers to care, exacerbating global health inequities.

2. Methodology

The study employed a cross-sectional design to evaluate the readability of hearing health information generated by seven LLMs in two languages (English and Portuguese) under two conditions.

Models Evaluated:
1. OpenAI-ChatGPT-4o (GPT)
2. DeepSeek-R1 (DpS)
3. Claude-3.7-Sonnet (Cd)
4. Google-Gemini-2.0-Flash (Gm)
5. Mistral-AI-LeChat-8x22B (leC)
6. Maritaca-10B-Sábia3 (Ma) – Notable for being a Global South model.
7. Microsoft-Copilot-April-2025 (Co)
Data Generation:
- Prompts: Derived from WHO guidelines on hearing aid service delivery, covering 8 clinical domains (e.g., congenital anomalies, chronic otitis media).
- Conditions:
  1. Baseline: Models generated responses using only pre-trained knowledge.
  2. Wikipedia-Sourced: Models were instructed via prompt to ground responses in Wikipedia articles (simulating RAG).
- Total Data: 1,120 data points (7 models × 8 prompts × 2 languages × 2 conditions × 5 metrics).
Readability Metrics: Five standard metrics were used to assess text complexity:
1. Flesch Reading Ease (FRE)
2. Flesch-Kincaid Grade Level (FKG)
3. Simple Measure of Gobbledygook (SMOG)
4. Automated Readability Index (ARI)
5. Coleman-Liau Index (CLI)
- Note: Metrics were binarized into "Upper" (Lower complexity/High readability) and "Lower" (High complexity/Low readability) to facilitate statistical homogeneity testing.
Statistical Analysis:
- G-test (Likelihood-ratio test): Used to test for homogeneity in classification proportions.
- Hypotheses:
  - H1: Readability classifications are homogeneous across different LLM architectures.
  - H2: Readability classifications are homogeneous across different readability metrics.
- Tools: Python (for metric calculation), R (for statistical analysis), and LibreOffice Calc.

3. Key Results

A. Model Architecture Variability (H1)

Baseline Condition: Readability classifications were statistically homogeneous across all seven models in both English and Portuguese. This suggests that standard conversational generation yields similar surface-level complexity regardless of the model.
Wikipedia-Sourced Condition: Readability classifications became significantly heterogeneous ( $p < .05$ $p < .05$ ) across models.
- Integration Variability: The study coined this term to describe how different models process the same external source (Wikipedia) differently.
- Specific Findings:
  - In English, GPT produced significantly more accessible text (50% "Upper"), while Claude produced significantly more complex text (92.5% "Lower").
  - In Portuguese, Copilot produced the most complex text (97.5% "Lower"), while Gemini was more accessible (32.5% "Upper").
- Conclusion: Grounding on verified sources does not guarantee consistent readability; it introduces variability dependent on the model's internal synthesis mechanisms.

B. Metric Variability (H2)

All Conditions: Significant heterogeneity was observed across the five readability metrics in all groups (Baseline and Wikipedia-sourced, both languages).
Metric Disagreement: Metrics relying on different linguistic features (e.g., SMOG based on syllables vs. CLI based on characters) frequently classified the same text differently.
Conclusion: Readability metrics are not interchangeable. The choice of metric is as significant a source of variability as the choice of the LLM itself.

4. Key Contributions

Identification of "Integration Variability": The study demonstrates that RAG/grounding strategies, while improving factual accuracy, can destabilize readability consistency across different model architectures.
Metric Non-Interchangeability: It provides empirical evidence that using a single readability metric to evaluate LLM outputs is methodologically unsound due to systematic disagreement between metrics.
Cross-Language Validation: The findings hold true for both English and Portuguese, highlighting that these issues are not language-specific but structural to LLM processing and metric design.
Governance Trade-off: The paper explicitly identifies a trade-off: Source grounding improves verifiability but may reduce accessibility and consistency.

5. Significance and Implications

Health Equity: Inconsistent readability can widen health literacy gaps, delaying care for vulnerable populations. Relying on a single model or metric could unintentionally create new disparities.
Evaluation Protocols: The authors argue for transparent, vendor-agnostic evaluation criteria. Health communication protocols must:
- Use a battery of multiple readability metrics rather than a single score.
- Apply language-specific thresholds.
- Re-evaluate models whenever the underlying architecture or grounding configuration (e.g., switching from baseline to RAG) changes.
Public Good Strategy: The study emphasizes the need to treat knowledge sources (like Wikipedia) as public goods. If LLMs are increasingly trained on or grounded in AI-generated content (the "Ouroboros effect"), the integrity and readability of the entire information ecosystem could degrade.
Future Research: Calls for longitudinal studies to track model iteration, validation across other medical domains, and the development of human-in-the-loop validation to ensure that "easy to read" text is also clinically accurate.

Conclusion: The paper concludes that while LLMs offer promise for health communication, their deployment requires rigorous, multi-metric, and model-agnostic auditing to ensure that the pursuit of factual accuracy does not come at the cost of linguistic accessibility.

Large Language Models Readability Classification: A Variability Analysis of Sources and Metrics

1. The Setup: The "Cooking Class"

2. The Big Surprise: The "Wikipedia Twist"

3. The Second Surprise: The "Confused Critics"

4. The "Techno-Solution" Trap

5. The Bottom Line: What Should We Do?

1. Problem Statement

2. Methodology

3. Key Results

A. Model Architecture Variability (H1)

B. Metric Variability (H2)

4. Key Contributions

5. Significance and Implications

More like this

The effect of sedentary behaviour and physical activity on 1719 diseases: a Mendelian randomisation phenome-wide association study (MR-PheWAS)

Assessing the Impact of Timing and Coverage of United States COVID-19 Vaccination Campaigns: A Multi-Model Approach

Evidence on WASH interventions in Negelle-Arsi District, Oromia Regional State, Ethiopia: a cross-sectional data analysis

Identification of Spatiotemporal Associations of Social Determinants of Health on the Incidence of Adverse Birth Outcomes in Louisiana

Physical activity buffers physiological stress during high emotional distress: a wearable-derived prospective cohort study