Large Language Models Readability Classification: A Variability Analysis of Sources and Metrics

This study reveals that while Large Language Models produce homogeneous readability at baseline, their output complexity becomes significantly variable when grounded in external sources like Wikipedia, and that readability metrics are not interchangeable, necessitating transparent, metric-specific, and language-aware evaluation protocols for accessible health communication.

Corrale de Matos, H. G., Wasmann, J.-W. A., Catalani Morata, T., de Freitas Alvarenga, K., Bornia Jacob, L. C.

Published 2026-03-02
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to explain a complex medical diagnosis to a friend who has no medical training. You want the information to be true (so they don't get sick) and easy to understand (so they actually listen).

This paper is like a report card for seven different "AI Chatbots" (like advanced versions of Siri or ChatGPT) trying to do exactly that: explain hearing health issues to patients in English and Portuguese.

Here is the story of what they found, using some simple analogies:

1. The Setup: The "Cooking Class"

The researchers gave seven different AI chefs the same recipe card (based on World Health Organization guidelines about hearing aids). They asked them to cook up an explanation for patients.

They tested the chefs in two ways:

  • Round 1 (The Baseline): "Just use what you already know." (The chefs cooked from their own memory).
  • Round 2 (The Wikipedia Grounding): "Go look up the facts on Wikipedia first, then cook." (The chefs were forced to use a specific, verified source).

They then hired five different "Food Critics" (Readability Metrics) to grade the dishes. These critics didn't taste the food; they just measured the ingredients (sentence length, word difficulty, syllable counts) to see how hard the dish would be to chew.

2. The Big Surprise: The "Wikipedia Twist"

What happened in Round 1?
When the chefs cooked from their own memory, they were surprisingly similar. They all produced dishes that were roughly the same "difficulty level." It was like a group of students taking a test without a textbook; they all gave answers that were about equally hard to read.

What happened in Round 2?
When the chefs were told to use Wikipedia, chaos broke out.
Suddenly, the same recipe produced wildly different results depending on which chef you asked.

  • Chef A (e.g., ChatGPT) took the Wikipedia facts and made a simple, easy-to-eat salad.
  • Chef B (e.g., Copilot) took the exact same Wikipedia facts and served a complex, heavy steak that was very hard to chew.

The Lesson: Even when you give AI the same verified facts, different AI models process them differently. Some summarize simply; others copy the complex language of the source. This means that if you switch AI models, the "difficulty" of your health advice might change overnight, even if the facts stay the same.

3. The Second Surprise: The "Confused Critics"

The researchers also looked at the five "Food Critics" (the readability metrics). They expected the critics to agree on how hard the food was.

They didn't.

  • Critic #1 (who counts syllables) said, "This is a gourmet, difficult meal!"
  • Critic #2 (who counts characters) said, "No, this is a simple snack!"

The Lesson: There is no single "Readability Score." Depending on which tool you use to measure the text, you might get a completely different grade. If you rely on just one tool, you might think a text is easy when it's actually confusing, or vice versa.

4. The "Techno-Solution" Trap

The paper warns against a common trap called "Techno-solutionism." This is the belief that "If we just give the AI better facts (like Wikipedia), everything will be perfect."

The study shows this isn't true.

  • The Trade-off: Giving the AI verified facts (Wikipedia) makes the information more accurate, but it often makes it harder to understand because the AI gets stuck trying to copy the complex language of the source.
  • The Risk: You might end up with a health guide that is 100% true but written in a language so complex that the patient gives up and doesn't read it.

5. The Bottom Line: What Should We Do?

The authors suggest three rules for anyone using AI for health advice:

  1. Don't trust just one AI: Different models act differently. You can't assume they are interchangeable.
  2. Don't trust just one score: Don't rely on a single "readability number." Use a battery of different tests to get the full picture.
  3. Keep a human in the loop: Just because an AI says something is "true" doesn't mean it's "accessible." We need to check if the patient can actually understand it.

In a nutshell:
Imagine you are building a bridge for people to cross a river (getting health information).

  • Accuracy is making sure the bridge doesn't collapse (using Wikipedia facts).
  • Readability is making sure the bridge has a ramp so everyone can walk across, not just professional climbers.

This paper found that while we are getting better at building strong bridges (accuracy), we are accidentally building some with steep, dangerous cliffs (complexity) depending on which construction crew (AI model) we hire. We need to check the ramps carefully, no matter how strong the bridge is.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →