Original authors: P. Bilha Githinji, Aikaterini Melliou, Zeming Liang, Lian Zhang, Peiwu Qin

Published 2026-05-07

📖 4 min read☕ Coffee break read

Original authors: P. Bilha Githinji, Aikaterini Melliou, Zeming Liang, Lian Zhang, Peiwu Qin

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a library of medical textbooks written in a secret, highly complex code. These books contain life-saving information, but they are so difficult to read that the average person can't understand a single sentence. The goal of this study was to see if two different "AI translators" could decode these books into plain English without losing the important facts.

The researchers tested two specific AI models:

Mistral: A model tuned to follow instructions very carefully.
Qwen: A model designed to "think harder" and reason through complex problems.

They asked these AIs to rewrite 750 difficult medical summaries into simple language, then compared the results against what human experts did. Here is what they found, using some everyday analogies:

The "Translator" Showdown

Think of the task like translating a dense, technical legal contract into a friendly letter. You need to keep the meaning exactly the same, but make it easy to read.

1. Mistral: The Careful Editor
Mistral acted like a conservative editor. It took the complex medical text and swapped out big, scary words for simpler ones, but it was very careful not to change the story.

The Result: It produced text that was easy to read and, crucially, stayed true to the original meaning. Its "fidelity" (how well it kept the facts) was almost identical to what a human expert would produce.
The Strategy: It mostly just swapped jargon for plain words and kept the sentence structure mostly the same. It didn't try to add new ideas or explain things too much; it just made the existing text clearer.

2. Qwen: The Over-Explainer
Qwen acted like an enthusiastic teacher who wants to make sure you understand everything. It didn't just swap words; it tried to expand on concepts, add explanations, and break things down further.

The Result: While the text it produced was very easy to read (sometimes even easier than Mistral's), it occasionally lost the thread of the original meaning. It was like a teacher who explains a concept so well that they accidentally add a tiny bit of their own opinion or miss a small detail from the original text.
The Strategy: It took more risks. It tried to "reason" through the text, which led to some creative simplifications but also some factual drift.

The "Scorecard"

The researchers used a scoreboard to grade the AIs:

Readability: Both AIs did a great job making the text easier to read. In fact, they were often better at making the text "short and sweet" than the humans were.
Accuracy: This is where they differed. Mistral kept the facts safe 91% of the time (matching human experts). Qwen kept the facts safe 89% of the time. That 2% difference might sound small, but in the world of medical information, it means Qwen was slightly more likely to accidentally change a fact or drop a crucial detail.

The "Toolbox" Problem

The study also looked at how we measure success. The researchers found that many of the tools used to grade readability (like formulas that count syllables or sentence length) are actually measuring the same thing in slightly different ways. It's like having five different rulers that all measure inches but have slightly different markings.

They discovered that the hardest part of simplifying medical text isn't breaking up long sentences (syntax); it's handling the specialized vocabulary (lexicon).

Mistral handled the vocabulary by being conservative: "If I'm not sure, I'll keep the original word or swap it very carefully."
Qwen handled the vocabulary by being adventurous: "I'll try to explain this word or find a totally different way to say it," which sometimes led to confusion.

The Bottom Line

The paper concludes that if you want an AI to simplify medical text without changing the facts, Mistral is currently the safer bet. It acts like a reliable translator who knows exactly when to stop and not over-explain.

Qwen is also very capable and produces very readable text, but its "reasoning" style makes it a bit more prone to drifting away from the original facts. The study suggests that for medical information, where accuracy is life-or-death, the "conservative editor" approach is currently superior to the "creative explainer" approach.

Important Note: The study only looked at how well these models simplified text right now using standard prompts. It did not test how these models would perform in a real hospital, nor did it suggest they should replace doctors or human reviewers. It simply compared their ability to do one specific job: turning hard medical words into easy ones.

Technical Summary: Divergent Readability-Accuracy Strategies of Mistral and QWen in Biomedical Text Simplification

Problem Statement

Access to understandable health information is critical for public health and informed decision-making, yet patient-facing biomedical materials frequently exceed recommended reading levels. While Large Language Models (LLMs) offer a scalable solution for text simplification, they face a persistent trade-off: improving readability often comes at the cost of factual inaccuracies, semantic drift, and undesirable omissions. Existing research suggests that domain adaptation is necessary for biomedical text, yet results are conflicting, with some studies showing general-purpose models outperforming specialized ones. Furthermore, there is a lack of comprehensive understanding regarding how different LLM architectures navigate the tension between maximizing readability and preserving discourse fidelity without fine-tuning.

Methodology

This study empirically compares two medium-sized, general-purpose LLMs—Mistral-Small 3 24B (instruction-tuned) and Qwen 2.5 32B (reasoning-augmented)—in the task of biomedical text simplification.

Data: The primary benchmark consists of 750 biomedical abstracts paired with human-simplified texts. A secondary uncurated dataset covering Traditional Chinese Medicine (TCM) and Oncology was used to test robustness.
Systems: The study evaluates four LLM configurations (two models × two temperature settings: strict $T=0.2$ and flexible $T=0.4$ ) against a human expert benchmark.
Prompting: A standardized zero-shot prompt was employed, instructing models to perform sentence-by-sentence adaptation without summarization. The prompt explicitly prohibited content distillation and required the models to self-report the specific transformation applied (e.g., jargon swapping, omitting details) and the rationale for each change.
Evaluation: A comprehensive suite of 21 metrics was utilized, categorized into:
- Readability: Dale-Chall, Gunning Fog, FKGL, SMOG, ARI, Flesch Reading Ease, and SARI.
- Accuracy/Discourse Fidelity: BERTScore, Semantic Similarity (LLM embeddings), ROUGE-L, SacreBLEU, LDA-topics, vocabulary matching, and difficult word proportion.
- Safety: Toxicity classification.
Analysis: Statistical comparisons (Welch's t-test) were conducted, alongside correlation analyses and Principal Component Analysis (PCA) regression to examine the relationships between readability and accuracy metrics.

Key Results

1. System Performance and SARI Scores

Both models outperformed previous encoder-decoder baselines (T5, BART). Mistral demonstrated superior performance with SARI scores of 42.46 (flexible) and 42.37 (strict), approaching the performance of GPT-4.1-mini. QWen scored lower at 38.38 (strict) and 37.84 (flexible).

2. Readability vs. Accuracy Trade-off

Mistral: Exhibited a "tempered" lexical simplification strategy. It achieved readability improvements across multiple metrics while maintaining BERTScore of 0.91, which was statistically indistinguishable from human performance. It showed high vocabulary retention and conservative handling of specialized terms.
QWen: Achieved enhanced readability (ranking best on Flesch-Kincaid and Flesch Reading Ease) but displayed a disconnect between readability and accuracy. Its BERTScore was 0.89, statistically lower than the human benchmark. QWen's approach involved more aggressive lexical substitution and conceptual expansion, leading to greater semantic displacement.

3. Metric Correlations and Redundancy

Redundancy: Strong functional redundancies were found among readability metrics (correlations $\ge 0.7$ for SMOG, FKGL, ARI, and Flesch), suggesting that a reduced set of metrics could suffice for evaluation.
Divergent Strategies: Correlation analysis revealed that Mistral's readability and accuracy metrics were more tightly coupled (coefficients $[0.2, 0.4]$ ) compared to QWen ( $[-0.2, 0.1]$ ). This indicates Mistral optimizes for both objectives simultaneously, whereas QWen's strategies appear more disconnected.
Lexical Control: The study found that lexical control, rather than syntactic restructuring, is the primary hurdle. Mistral's conservative retention of specialized vocabulary correlated strongly with accuracy, while QWen's aggressive substitution correlated negatively with semantic integrity.

4. Self-Reported Rationales

Analysis of the models' self-reported changes confirmed their architectural philosophies:

Mistral primarily relied on "jargon/parlance swapping" and "omitting unnecessary details," operating conservatively within the bounds of the input.
QWen frequently engaged in "adding explanation" and "abstracting/generalizing," reflecting a more exploratory approach that risks semantic degradation.

Significance and Claims

The paper claims that instruction-tuned models (Mistral) may offer a more robust "sweet spot" for biomedical text simplification compared to reasoning-augmented models (QWen) when operating in a zero-shot setting. The study highlights that:

Architectural Advantage: Mistral's instruction tuning appears to favor a conservative strategy that balances lexical simplification with semantic fidelity, achieving human-level discourse fidelity without fine-tuning.
Metric Insights: The research provides evidence of strong redundancies in readability metrics and clarifies the tension between readability and accuracy, suggesting that current metric suites may not fully capture the nuances of reasoning-augmented models' simplification processes.
Practical Baseline: The findings update practical baselines for biomedical text simplification, indicating that for general-purpose LLMs, the primary challenge lies in lexical control rather than syntactic restructuring.

The authors conclude that while QWen is capable and achieves high readability scores, its aggressive exploration of the lexical search space risks semantic integrity. In contrast, Mistral's tempered approach offers a more reliable balance for scalable, accessible biomedical information. The study acknowledges limitations, noting that further evaluation across a wider range of LLMs and domains is required to definitively characterize architectural differences.

Making Knowledge Accessible: Divergent Readability-Accuracy Strategies of Mistral and QWen in Biomedical Text Simplification