MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining

Imagine you are trying to teach a robot doctor how to look at X-rays and CT scans. You want the robot to learn by looking at thousands of pictures and reading the reports written by human radiologists that go with them. This is called Vision-Language Pretraining.

However, there's a big problem: Human doctors write reports in very different ways. One doctor might write a long, rambling story with personal notes and history. Another might use short, bullet-point lists. Some might use fancy medical jargon, while others use plain English. It's like trying to teach a child to recognize apples by showing them pictures of apples, but describing them with sentences like "The red round thing," "A fruit that grows on trees," "A crunchy snack," and "A symbol of health." The robot gets confused by the messy descriptions, even though the pictures are the same.

This is where MedTri comes in. Think of MedTri as a super-organized translator or a strict editor that cleans up these messy reports before the robot ever sees them.

The Problem: The "Messy Room"

Imagine the raw medical reports are like a messy bedroom.

The Clothes: The important medical facts (like "there is a shadow on the lung").
The Junk: Irrelevant stuff (like "the patient has a history of smoking" or "we should schedule a follow-up next week").
The Style: Some rooms are chaotic, some are tidy, but they all look different.

If you try to teach the robot using these messy rooms, it spends too much time looking at the junk and not enough time learning what the clothes (the actual medical findings) look like.

The Solution: MedTri's "Uniform Box"

MedTri takes that messy room and forces everything into a standardized, three-part box for every single finding. It ignores the junk and the style differences.

The box looks like this:
[Body Part] : [What it looks like] + [What the doctor thinks it is]

Raw Report: "The patient's left lung shows some patchy white areas which could be pneumonia, but we need to rule out other causes."
MedTri's Box: Left Lung: Patchy white areas + Possible Pneumonia

It does this for every single body part mentioned. Now, instead of reading a confusing paragraph, the robot sees a clean, consistent list:

Left Lung: Patchy white areas + Possible Pneumonia
Heart: Normal size + No issues

Why This is a Game Changer

The paper shows that when you use this "clean box" method, the robot learns much faster and better.

It's Private and Fast: Usually, to clean up text this well, you need to send it to a giant, expensive cloud computer (like a super-smart AI service). MedTri is like a small, local robot that lives on your own computer. It does the cleaning quickly without sending your patient's private data to the internet.
It's a "Smart" Cleaner: MedTri doesn't just delete words; it understands anatomy. It knows that "heart" and "lungs" are different, so it keeps them separate. It strips away the "fluff" but keeps the "meat" (the actual visual details).
The "Training Drills" (Augmentation): The authors added two extra tricks to make the robot even smarter:
- MedTri-K (The Dictionary): If the robot sees "Pneumonia," this module adds a little note saying, "Oh, pneumonia usually looks like white clouds in the lung." It teaches the robot to connect the word to the picture.
- MedTri-C (The "What If" Game): This module creates tricky examples. It takes a report and swaps a detail, like saying "The right lung has pneumonia" when the picture shows the left. This forces the robot to pay close attention to exactly where things are, rather than just guessing based on general patterns.

The Results

The researchers tested this on thousands of X-rays and CT scans. They found that:

Robots trained with MedTri's clean boxes were significantly better at diagnosing diseases than robots trained on the messy original reports.
They were even better than robots trained on other "cleaned" versions of reports.
The improvement was huge, especially when the robot didn't have many examples to learn from (like in a small hospital with fewer patients).

The Bottom Line

MedTri is like a universal translator that turns the chaotic, human way of writing medical reports into a clean, structured language that computers can actually understand. By organizing the information into neat, anatomy-based boxes, it helps AI doctors learn to see the world through the eyes of a radiologist, faster, cheaper, and more privately than ever before.

1. Problem Statement

Medical Vision-Language Pretraining (VLP) leverages paired medical images and reports to learn robust representations without manual annotations. However, raw clinical reports present significant challenges for effective multimodal alignment:

Heterogeneity: Reports vary widely in style, verbosity, and structure.
Noise: They often contain image-irrelevant content (e.g., clinical history, management recommendations) that dilutes visual signals.
Inconsistency: The lack of standardized formatting weakens the fine-grained correspondence between visual findings and textual descriptions.
Limitations of Existing Solutions: Current normalization methods are either:
- Schema-based/NER-driven (e.g., RadGraph): Focus primarily on entity extraction, often losing detailed morphological descriptions.
- Cloud-based LLMs: Rely on large generative models for rewriting, which introduces high computational overhead, latency, and privacy concerns, making them unsuitable for local deployment.

There is a critical need for a lightweight, locally deployable, and anatomically expressive normalization framework that standardizes reports while preserving essential visual-grounded information.

2. Methodology: MedTri Framework

MedTri is a normalization platform designed to convert free-text radiology reports into a unified, structured format.

A. Core Normalization Schema

MedTri decomposes reports into a set of structured triplets following the format:

[Anatomical Entity: Radiologic Description + Diagnosis Category]

Anatomical Entity: Acts as an anchor (e.g., "Lung," "Main pulmonary artery").
Radiologic Description: Objective imaging attributes (location, size, density, texture).
Diagnosis Category: The associated clinical interpretation.
Goal: This schema disentangles components to reduce lexical variability while preserving the minimal semantic unit of radiologic reasoning.

B. Local Model Development

To ensure scalability and privacy, MedTri avoids reliance on cloud LLMs during inference:

Data Generation: A dataset of >100,000 reports from diverse modalities (X-ray, CT, MRI) was processed using a standardized ChatGPT-5.1 prompt to generate high-quality "gold standard" triplet references.
Distillation: A lightweight, locally deployable biomedical encoder–decoder model (BioBart) was fine-tuned on these paired samples (Original Report $\to$ Structured Triplet) using standard cross-entropy loss. This allows the system to approximate cloud-level normalization quality on local hardware.

C. Optional Text-Level Augmentation Modules

MedTri provides a modular interface for optional training-time augmentations that do not alter the core normalization:

MedTri-K (Knowledge Enrichment): Augments triplets with concise descriptions of characteristic radiological appearances (e.g., linking "pneumonia" to "parenchymal consolidation") retrieved from a curated medical dictionary. This strengthens semantic grounding.
MedTri-C (Anatomy-Grounded Counterfactuals): Generates "hard negative" samples by locally perturbing the triplet descriptions (e.g., swapping a finding in the "Lung" to a finding from the "Liver" while maintaining syntactic structure). This forces the model to learn fine-grained visual-semantic discrimination rather than relying on coarse global signals.

3. Key Contributions

MedTri Platform: A deployable normalization framework that converts unstructured reports into anatomy-grounded triplets, balancing privacy, efficiency, and clinical accuracy.
Systematic Evaluation: The first study to systematically isolate the impact of structured text normalization on VLP quality, demonstrating that normalization alone is a critical factor in performance.
Novel Augmentation Strategies: Introduction of knowledge enrichment and counterfactual supervision specifically tailored for structured medical triplets, showing complementary gains in robustness and generalization.
Local Deployment: Proven ability to achieve near-cloud-LLM quality normalization using a lightweight local model, addressing privacy and cost barriers.

4. Experimental Results

The authors evaluated MedTri across multiple datasets (MIMIC-CXR, CT-RATE, NIH ChestX-ray14, etc.) and modalities (X-ray and CT).

A. Normalization Quality (Table 3)

Expert Evaluation: Board-certified physicians rated MedTri's Anatomical Correctness (4.838/5) and Image Groundedness (4.763/5) nearly identical to the Cloud LLM reference (ChatGPT-5.1) and significantly higher than open-source baselines (Qwen2.5).
Efficiency: MedTri is drastically more efficient, requiring only 1.5 GB VRAM (vs. 15 GB for Qwen2.5) and processing reports in 0.45 seconds (vs. 3.25s for LLMs).
Text Similarity: MedTri achieved higher BERT, BLEU, and ROUGE scores against the gold-standard reference compared to other local models.

B. Downstream Task Performance (Tables 4 & 5)

MedTri and its variants consistently outperformed raw reports and RadGraph baselines across Swin Transformer and ViT backbones:

General Improvement: MedTri yielded consistent gains in Accuracy and F1 scores across all datasets.
Low-Data Regimes: The improvements were most pronounced in 1% and 10% data settings, indicating that structured normalization significantly improves sample efficiency.
- MedTri-K excelled in low-data regimes by providing explicit semantic grounding.
- MedTri-C showed stronger gains in medium-to-full data regimes by enhancing fine-grained discrimination.
Robustness: The performance gains were observed across both 2D X-ray and 3D CT modalities, confirming the generalizability of the approach.

5. Significance and Conclusion

MedTri addresses a fundamental bottleneck in medical AI: the "noise" in clinical text that hinders vision-language alignment. By transforming unstructured reports into structured, anatomy-grounded triplets, MedTri:

Enhances Pretraining: Provides consistent, image-grounded textual supervision that improves representation learning.
Democratizes Access: Offers a privacy-preserving, lightweight alternative to expensive cloud-based LLM pipelines, making advanced normalization accessible to local research and clinical environments.
Future-Proofs Research: Establishes that text normalization is not merely a preprocessing step but a critical, generalizable component of the medical VLP pipeline.

The authors conclude that structured normalization is a vital prerequisite for high-quality medical vision-language learning and that MedTri provides the necessary infrastructure to implement it at scale.