Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

This study evaluates seven open-source large language models for assisting Japanese pathology report writing, finding that while thinking and medical-specialized models excel in structured reporting and typo correction, their utility varies by task and rater preference, suggesting they are beneficial in specific, clinically relevant scenarios.

Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka, Mikio Sakurai, Yasuto Fujimoto, Yoshiyuki Takahara, Atsushi Ohara, Hirohiko Miyake, Genichiro Ishii

Published 2026-03-13
📖 4 min read☕ Coffee break read

Imagine a busy hospital where pathologists are like detectives. Their job is to examine tiny tissue samples under a microscope and write detailed reports explaining what's wrong with a patient. These reports are crucial, but they are also tedious, full of strict formatting rules, and prone to tiny typos that can cause big confusion.

This paper is like a taste test for seven different "AI assistants" (Open-Source Large Language Models) to see which ones are good enough to help these detectives write their reports in Japanese.

Here is the breakdown of the experiment using simple analogies:

1. The Contestants (The AI Models)

The researchers didn't just pick random robots. They picked seven specific "brains" available in early 2026.

  • The Generalists: Some are like smart, all-purpose students (e.g., Gemma, Qwen).
  • The Specialists: Some are like students who went to medical school specifically (e.g., MedGemma, SIP-jmed).
  • The Thinkers: Some are designed to "think before they speak," taking extra time to reason through a problem (e.g., Qwen-Thinking, gpt-oss).

They ran all these models on a powerful local computer (a Mac Studio), simulating a hospital setting where patient data can't leave the building for privacy reasons.

2. The Three Challenges

The AI models were put through three distinct tests, like a triathlon:

Challenge A: The "Fill-in-the-Blank" Test (Formatting & Extraction)

  • The Task: Imagine a doctor gives the AI a messy list of facts (JSON data) and says, "Turn this into a perfect, official hospital report." Or, "Read this official report and pull the facts back out into a list."
  • The Result:
    • The Generalists were great at copying the format perfectly. They were like fast typists who never miss a comma.
    • However, when the task required math or logic (like calculating cancer stages based on tumor size), the regular models got confused. They guessed.
    • The "Thinker" models shined here. Because they pause to reason, they got the math right almost 100% of the time. They were like the students who actually understood the lesson, not just memorized the answers.

Challenge B: The "Proofreader" Test (Typo Correction)

  • The Task: The researchers took real reports and intentionally added typos (swapping letters, deleting words, using the wrong Japanese characters). The AI had to find and fix them without changing the meaning.
  • The Result:
    • This was tricky. Some AIs were too aggressive and deleted whole sentences (like a proofreader who throws away the page because they found one typo).
    • The Medical Specialists and the Thinkers did the best job. They understood the context of medical terms better than the general models. One model (Qwen) was the most balanced "proofreader," fixing errors without breaking the text.

Challenge C: The "Teacher" Test (Explaining to Humans)

  • The Task: The AI had to write a simple explanation of a complex cancer case, as if teaching a new medical resident.
  • The Result: This was the most surprising part.
    • Human taste is subjective. Just like how one person loves spicy food and another hates it, the doctors and clinicians who graded the AI had very different opinions.
    • One doctor might say, "This explanation is perfect!" while another says, "This is confusing."
    • There was no single "best" AI. The "Thinker" models were liked more by pathologists, while others were preferred by clinicians. It showed that what sounds good to a doctor depends entirely on who you ask.

3. The Big Takeaway

The paper concludes that there is no "one-size-fits-all" robot for this job yet.

  • If you need speed and perfect formatting: Use a standard, fast model.
  • If you need complex logic or math: Use a "Thinking" model, even if it's slower.
  • If you need to fix typos or explain things: A medical-specialized model is your best bet.

The "Local" Advantage:
The authors emphasize that using these open-source models is like having a private library in your hospital. You don't have to send sensitive patient data to a big tech company (like Google or OpenAI) over the internet. You can run the AI right there in the hospital, keeping secrets safe.

The Bottom Line

Open-source AI is ready to be a valuable intern for Japanese pathologists, but it's not ready to be the boss yet. It needs to be assigned the right tasks (logic vs. typing vs. explaining) and needs a human to double-check its work, especially because different doctors have different styles.

In short: The tools are here, but we still need to learn how to use them wisely.