Evaluation and LLM-Guided Learning of ICD Coding Rationales

This paper addresses the lack of systematic evaluation and dedicated training methods for ICD coding rationales by introducing a novel multi-granular dataset to assess faithfulness and plausibility across different rationale types, and subsequently leveraging high-quality LLM-generated rationales as distant supervision to significantly improve the plausibility of rationale generation in both large language models and specialized student models.

Mingyang Li, Viktor Schlegel, Tingting Mu, Wuraola Oyewusi, Kai Kang, Goran Nenadic

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you are a doctor trying to understand why a computer program (an AI) decided a patient has a specific disease. The AI says, "This patient has Type 2 Diabetes." But it doesn't just give you the answer; it needs to show you why. It needs to point to the specific sentences in the patient's medical notes that led to that conclusion.

In the world of medical coding, these "why" explanations are called rationales.

This paper is like a detective story investigating three different ways an AI can give these explanations, and then teaching the AI how to give better ones using a super-smart robot assistant (a Large Language Model, or LLM).

Here is the breakdown in simple terms:

1. The Problem: The AI is a "Black Box"

Currently, hospitals use AI to automatically turn messy doctor's notes into standard codes (like ICD-10 codes) for billing and records. The AI is great at getting the code right, but it's terrible at explaining how it got there.

  • The Old Way: The AI used to just highlight random words based on "attention" (like a spotlight that gets a bit fuzzy). It's like a student guessing the answer on a test and then pointing to random words in the textbook to justify it, even if those words aren't actually the reason.
  • The Issue: Doctors don't trust the AI if they can't see the real evidence. Plus, the old datasets used to test these explanations were outdated (like using a map from 1990 to navigate a city in 2026).

2. The New Tool: A Fresh, High-Quality Map

The researchers built a brand-new dataset called RD-IV-10.

  • The Analogy: Imagine you are training a new chef. Instead of giving them a recipe book with torn pages and missing ingredients (the old data), you give them a pristine, modern cookbook with high-quality photos and step-by-step instructions.
  • What they did: They took 150 real patient records and had human medical experts carefully highlight exactly which sentences proved a diagnosis. This created a "Gold Standard" to test if the AI's explanations are actually good.

3. The Three Types of Explanations (The Contest)

The researchers tested three different "students" to see who could give the best explanation:

  • Student A (The Entity Linker): This is a robot that just looks for specific medical terms (like "diabetes" or "aspirin") and highlights them.
    • Verdict: It's okay, but it's like a search engine. It finds the words, but it doesn't understand the story or the context.
  • Student B (The Attention Model): This is the old-school AI that highlights words based on mathematical weights.
    • Verdict: It's the worst. It often highlights random words that have nothing to do with the diagnosis. It's like a student highlighting the word "the" because it appears a lot, not because it explains the disease.
  • Student C (The LLM - Large Language Model): This is a super-smart AI (like a very advanced chatbot) that reads the whole note and writes a summary of why the diagnosis fits.
    • Verdict: The Winner! It generated explanations that humans found the most convincing and logical. It understood the context, not just the keywords.

4. The Breakthrough: Teaching the AI with a "Tutor"

Since the super-smart LLM (Student C) was so good at explaining things, the researchers asked: "Can we use this smart robot to teach the other, simpler AI models how to explain themselves?"

They used a technique called Distant Supervision.

  • The Analogy: Imagine a master chef (the LLM) writing down the perfect reasons for a dish. Then, they give those notes to a junior chef (the smaller AI model) and say, "Study these notes and try to learn how to explain your cooking."
  • The Result: The junior models got much better at explaining their decisions. They didn't just get the right answer; they learned to point to the right evidence.

5. The "Cheat Sheet" Trick (Few-Shot Prompting)

The researchers also tried a clever trick. When asking the super-smart LLM to write explanations, they gave it a few examples of perfect human-written explanations first.

  • The Analogy: It's like showing a student a sample essay before asking them to write their own.
  • The Result: The LLM's explanations became even more accurate and human-like. It learned the "style" of a good medical explanation.

The Big Takeaway

This paper proves two main things:

  1. Old AI explanations are often nonsense. We can't trust them yet.
  2. New AI (LLMs) can act as excellent teachers. By using a super-smart AI to generate "model answers," we can train smaller, faster AI models to not only diagnose patients correctly but also explain their reasoning in a way that human doctors can trust.

In short: They built a better test, found that smart chatbots are the best teachers, and used those chatbots to train medical AIs to finally "show their work" in a way that makes sense to humans.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →