Using Relative Risk Rankings to Understand Information… — Plain-Language Explanation

Original authors: Kim, C., Yoon, W., Lee, H., Lee, J.-O., Afshar, M., Kang, J., Miller, T. A.

Published 2026-04-07

📖 3 min read☕ Coffee break read

Original authors: Kim, C., Yoon, W., Lee, H., Lee, J.-O., Afshar, M., Kang, J., Miller, T. A.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to predict who might get sick again soon after leaving the hospital. You have two main ways to gather information about a patient:

The Raw Evidence: A high-definition photo of their lungs (a chest X-ray).
The Summary: A doctor's written note describing what they see in that photo.

Usually, for the sake of convenience, hospitals and computer programs often swap the photo for the written note. It's easier to read text than to analyze thousands of pixels. But this paper asks a crucial question: Does swapping the photo for the note throw away important clues?

The Experiment: The Detective vs. The Report

The researchers acted like detectives trying to solve a mystery: "Who is at high risk of dying within 30 days of leaving the hospital?"

They built a smart computer system (an AI) and gave it three different sets of clues to solve the mystery:

Clue Set A: Just the patient's general discharge summary (the "big picture").
Clue Set B: The general summary + the written report from the radiologist.
Clue Set C: The general summary + the actual X-ray image.

The Result:
The computer was best at solving the mystery when it looked at the actual X-ray image (Clue Set C). It was slightly less accurate with just the written report (Clue Set B), and least accurate with just the general summary (Clue Set A).

The "Missing Clues" Analogy

Why did the image win? Think of the X-ray as a crime scene photo and the radiologist's report as a police officer's written summary of that photo.

Even a great police officer might miss a tiny, subtle detail in the photo when writing their report. Maybe there's a faint shadow or a slight texture change that screams "danger" to a computer looking at the pixels, but the human doctor didn't think it was important enough to write down.

The study found that when the computer relied on the written report, it wasn't just "less smart" overall; it actually prioritized the wrong patients. It ranked low-risk patients as high-risk and vice versa. It's like a detective who, instead of looking at the photo, only reads the summary, ends up chasing the wrong suspect because a tiny, crucial detail was left out of the notes.

The Big Takeaway

The paper teaches us that text summaries are not perfect substitutes for raw images.

The Problem: We often replace complex data (images) with simple summaries (text) because it's easier.
The Risk: In doing so, we might lose subtle, life-saving information that only the raw data contains.
The Lesson: When building AI to predict health outcomes, we can't just check if the AI is "right" or "wrong." We also have to check if it is ranking patients correctly. If the AI looks at a photo, it might spot a hidden danger that the written report missed, leading to a better prediction of who needs the most help.

In short: Don't just read the summary; look at the picture. Sometimes, the most important clues are the ones nobody thought to write down.

1. Problem Statement

In clinical artificial intelligence, there is a growing trend to utilize multimodal models that integrate heterogeneous data sources (e.g., medical images and text). However, a common practical constraint is the substitution of raw data modalities with expert-written summaries for convenience. For instance, medical images (like Chest Radiographs or CXRs) are often replaced by their corresponding radiology reports in predictive workflows.

The core problem addressed is the lack of systematic characterization regarding whether such representational substitution preserves the full prognostic information contained in the raw data. Specifically, it remains unclear if replacing raw images with text reports merely reduces overall predictive discrimination or fundamentally alters the model's risk prioritization (ranking) of patients, potentially leading to missed prognostic cues.

2. Methodology

The authors investigated this issue through a comparative study using Vision-Language Models (VLMs) on a linked subset of the MIMIC-IV and MIMIC-CXR databases ( $n = 1,360$ paired cases).

Experimental Setup:
- Baseline Context: All models utilized a discharge note summary as the global clinical context.
- Comparative Arms:
  1. Discharge Note Only: Text-only input.
  2. Discharge Note + CXR: Context augmented with the raw chest radiograph image.
  3. Discharge Note + Report: Context augmented with the radiologist's written report (the text summary of the image).
- Task: Predicting 30-day post-discharge mortality.
Evaluation Metrics:
- Discrimination: Measured using the Area Under the Receiver Operating Characteristic Curve (AUROC).
- Ranking Agreement: To quantify the specific impact of modality substitution on patient prioritization, the authors calculated Kendall's tau-based distances between predicted risk rankings. This metric compares how similarly different models rank patients by risk, distinguishing between a general drop in performance and a shift in which patients are deemed high-risk.
- Qualitative Validation: A post-hoc review by a radiologist was conducted to identify visual prognostic cues present in images but potentially omitted in reports.

3. Key Results

Predictive Performance (AUROC):
- The Discharge Note + CXR model achieved the highest performance (AUROC = 0.864).
- The Discharge Note Only model performed moderately (AUROC = 0.831).
- The Discharge Note + Report model performed the worst among the augmented models (AUROC = 0.813).
- Observation: Replacing the raw image with the report resulted in a performance drop, and interestingly, the report-added model underperformed even the text-only baseline.
Ranking Analysis (Kendall's Tau):
- Inter-modality distances (differences in rankings between the CXR-based model and the Report-based model) were significantly larger than intra-modality distances (variations within the same modality).
- This indicates that substituting images with reports does not just lower the overall accuracy; it fundamentally changes the risk prioritization. The models identify different subsets of patients as high-risk, suggesting the report fails to capture specific visual cues critical for stratification.
Qualitative Findings:
- Radiologist review confirmed that expert reports, being clinically oriented summaries, often omit specific visual details (prognostic cues) that are visible in the raw images but crucial for accurate risk stratification.

4. Key Contributions

Systematic Quantification of Information Gaps: The study provides empirical evidence that expert-written summaries are imperfect proxies for raw medical images in prognostic modeling.
Novel Evaluation Framework: The authors introduce the use of Kendall's tau-based ranking distances to evaluate multimodal models. This moves beyond simple AUROC comparisons to analyze whether modality substitution alters the ordering of patient risk, which is critical for clinical decision-making.
Evidence of "Hidden" Information: The work demonstrates that raw images contain prognostic information not fully captured in standard radiology reports, challenging the assumption that text summaries are sufficient for all predictive tasks.

5. Significance

The findings have profound implications for the deployment of clinical AI:

Caution Against Substitution: Healthcare institutions relying on text-only models (due to data availability or cost) may be inadvertently discarding critical prognostic information, leading to suboptimal patient stratification.
Model Development: Developers of multimodal models should prioritize the integration of raw data modalities (images) alongside text, rather than relying solely on derived summaries.
Evaluation Standards: Future evaluations of clinical prediction models must assess both discrimination (AUROC) and ranking agreement (Kendall's tau) to ensure that model changes do not inadvertently shift the focus of care to the wrong patient population.

Using Relative Risk Rankings to Understand Information Differences in Multimodal Prediction Models