Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model

Imagine a pathologist as a detective trying to solve a crime, but instead of a crime scene, they are looking at a gigapixel image of a tissue sample. This image is so huge (billions of pixels) that if you tried to print it out, it would cover an entire city block. Reading every single pixel to write a medical report is impossible for a human to do quickly, and it's even harder for a computer.

This paper describes a new "AI detective" system that automates this process. Here is how it works, broken down into simple steps with some creative analogies:

1. The Problem: The "Needle in a Haystack"

A whole-slide image (WSI) is like a massive library containing billions of books, but the important information is hidden in just a few specific pages.

The Challenge: Standard AI models are like students who try to read every single book in the library at once. They get overwhelmed, run out of energy (computer memory), and often miss the important details.
The Solution: The authors built a system that acts like a smart librarian. Instead of reading the whole library, the librarian quickly scans the shelves, ignores the empty spaces (background glass), and only pulls out the specific books (tissue patches) that actually contain the story.

2. Step One: The "Smart Scan" (Pyramidal Scanning)

The system doesn't look at the image all at once. It uses a pyramid strategy:

The Wide View: First, it looks at a tiny, blurry thumbnail of the whole slide (like looking at a map from an airplane). It spots where the "interesting" tissue is and ignores the blank glass.
The Zoom In: Once it finds the tissue, it zooms in closer, like a detective using a magnifying glass. It breaks the image into small, manageable squares (patches).
The Quality Check: Before analyzing a square, it checks if the image is blurry (out of focus), too dark, or has dust on it. If a patch is garbage, it's thrown in the trash. Only the crisp, clear, high-quality patches move to the next step.

3. Step Two: The "Frozen Expert" (The UNI Model)

Now the system has a pile of high-quality tissue squares. It needs to understand what they are.

The Analogy: Imagine you have a world-renowned art critic who has studied millions of paintings. This critic is an expert, but they are very expensive to hire for every single job.
The Trick: Instead of hiring a new critic for every slide, the authors "freeze" this expert (called the UNI Foundation Model). They let the expert look at the tissue squares and write a detailed summary of what they see (e.g., "I see abnormal cells here," "This looks like lung tissue").
Why Freeze? By keeping the expert's brain frozen (not changing their knowledge), the computer saves massive amounts of energy and time. It's like using a pre-written encyclopedia entry instead of writing a new book from scratch.

4. Step Three: The "Translator" (The Decoder)

The expert (UNI) gives a technical summary, but it's not a full medical report yet. We need a translator to turn those technical notes into a readable story for a doctor.

The Translator: This is a lightweight AI (a Transformer decoder) trained specifically to speak "Medical English."
The Dictionary: Most AI models use a generic dictionary (like "car," "run," "blue"). This system uses a specialized medical dictionary (BioGPT). This ensures that when the AI says "invasive ductal carcinoma," it doesn't accidentally break the word into weird pieces like "in-vas-ive." It keeps the medical terms intact and precise.
The Result: The translator takes the expert's notes and writes a structured report: "Organ: Breast. Procedure: Biopsy. Diagnosis: Invasive Carcinoma."

5. Step Four: The "Fact Checker" (Retrieval Verification)

Even smart AI can sometimes "hallucinate"—make up facts that sound real but are wrong. In medicine, saying a tumor is "malignant" when it's "benign" is a disaster.

The Safety Net: Before the report is finalized, the system runs a fact-check. It compares the new report against a massive database of thousands of real, human-written reports.
The Swap: If the AI's report sounds 90% similar to a real, trusted report in the database, the system swaps the AI's version with the real, proven version. It's like a student copying the answer from a trusted textbook because they know it's correct. If the report is unique (a rare disease), it keeps the AI's version but flags it for review.

The Bottom Line

This paper presents a system that is fast, efficient, and reliable.

Instead of building a giant, expensive super-computer that tries to do everything at once, they built a team:
1. A Scout (Pyramid Scanner) to find the good spots.
2. A Frozen Expert (UNI) to identify the tissue.
3. A Specialized Translator (Decoder) to write the report.
4. A Fact Checker (Retrieval) to ensure accuracy.

In a recent competition with 24 other teams, this approach came in 8th place, proving that you don't need the biggest, most expensive AI to get great results—you just need the right workflow. It's a smarter way to help pathologists diagnose cancer faster and more accurately.

1. Problem Statement

The paper addresses the challenge of Automated Histopathology Report Generation (AHRG), which involves generating diagnostic text from Whole Slide Images (WSIs). The core difficulties include:

Scale Disparity: WSIs are gigapixel-scale images (often $>10^{10}$ pixels), making them computationally intractable for standard vision-language models designed for small images (e.g., $224 \times 224$ ).
Semantic Density: The output requires precise, domain-specific medical terminology. Standard models often suffer from "hallucinations" (generating plausible but factually incorrect features) or lack the fine-grained spatial grounding needed for accurate diagnosis.
Computational Cost: End-to-end training of Multimodal Large Language Models (MLLMs) on pathology data is prohibitively expensive and prone to discarding rare diagnostic features through token pruning.

2. Methodology

The authors propose a modular, hierarchical vision-language framework consisting of three sequential stages:

A. Hierarchical Pyramidal Patch Selection & Filtering

To handle gigapixel inputs, the system employs a coarse-to-fine scanning strategy rather than processing the full image at once.

Pyramidal Scanning: The WSI is processed at multiple resolution levels (downsampling factors $2^3$ to $2^6$ ).
Tissue Segmentation: A binary tissue mask is generated using HSV color space thresholding (filtering out background glass) and refined via morphological operations.
Quality-Aware Filtering: Candidate patches ( $256 \times 256$ $256 \times 256$ ) are retained only if they meet specific criteria:
- Focus Quality: Measured by Laplacian variance (rejecting out-of-focus patches).
- Exposure/Artifacts: Rejection based on HSV Value/Saturation ranges and dark pixel fractions (detecting dust or pen marks).
Sampling: A maximum budget of 2,500 patches per WSI is enforced via stratified random sampling across pyramid levels to ensure multi-scale representation.

B. Feature Extraction (Frozen Encoder)

Model: The UNI (Universal Pathology) Vision Transformer (ViT-Large/16), pre-trained on over 100 million histopathology patches, is used as a frozen feature extractor.
Strategy: The encoder parameters (307M) are fixed. This reduces GPU memory requirements significantly (from ~16GB to ~4GB) and allows for pre-computing and caching features.
Output: Each selected patch is converted into a 1024-dimensional feature vector.

C. Text Generation (Trainable Decoder)

Architecture: A lightweight, custom 6-layer Transformer decoder is trained on top of the frozen visual features.
Tokenization: The BioGPT tokenizer is used instead of generic tokenizers to better handle biomedical terminology (e.g., histological grades, cellular descriptions), reducing token fragmentation.
Mechanism: The decoder uses cross-attention to dynamically focus on relevant visual patch features while generating text autoregressively.
Training: Optimized using AdamW with a two-phase learning rate schedule (warmup + decay) over 350 epochs.

D. Retrieval-Based Verification (Post-Processing)

To mitigate hallucinations, a verification step is added:

Generated reports are encoded using Sentence-BERT.
They are compared against a corpus of ground-truth reports from the training set using cosine similarity.
Replacement Logic: If the similarity exceeds a threshold ( $\tau = 0.85$ ), the generated report is replaced with the retrieved ground-truth reference. Reports below the threshold are kept, assuming they represent valid but rare patterns.

3. Key Contributions

Hierarchical Pyramidal Scanning: A scalable strategy ( $2^3$ to $2^6$ downsampling) that prioritizes tissue regions while suppressing background and artifacts, making gigapixel processing tractable.
Modular Frozen-Encoder Design: Integrating the UNI foundation model as a frozen extractor with a lightweight decoder, avoiding the computational cost of end-to-end MLLM training while retaining robust morphological representations.
Domain-Specific Tokenization: Utilizing the BioGPT tokenizer to align the vocabulary with biomedical terminology, improving decoding efficiency and accuracy.
Retrieval-Augmented Verification: A novel post-processing step that uses Sentence-BERT to swap high-similarity generated outputs with ground-truth references, significantly improving reliability without complex Reinforcement Learning (RLHF) training.

4. Results

The framework was evaluated on the REG 2025 Grand Challenge dataset (10,494 WSI-report pairs across 7 organ systems).

Performance: The team (MedInsight-ViseurAI) achieved a composite ranking score of 0.8093, placing 8th out of 24 teams. This was within 4.7% of the top-performing method.
Qualitative Analysis:
- Strengths: The model demonstrated high accuracy in identifying organ sites, biopsy types, and primary diagnoses for common pathologies (e.g., invasive carcinoma, squamous cell carcinoma). It consistently adhered to standardized reporting templates.
- Weaknesses: Errors were observed in complex, multi-attribute grading schemas (e.g., distinguishing in situ vs. invasive carcinoma, or precise Gleason scoring), likely due to the combinatorial complexity of rare diagnostic combinations in the training data.
Efficiency: The modular approach allowed for iterative experimentation in resource-constrained environments compared to training billion-parameter end-to-end models.

5. Significance

This work demonstrates that competitive automated report generation does not necessarily require massive, end-to-end multimodal training. By combining:

Efficient, interpretable patch selection,
Pre-trained foundation models (frozen encoders),
Domain-adapted tokenization, and
Retrieval-based safety checks,

The authors achieved a robust system that balances computational efficiency with diagnostic reliability. The approach offers a practical pathway for clinical deployment, prioritizing structural consistency and minimizing hallucinations, which are critical for medical applications. Future work aims to address complex grading schemas through structured prediction and validate across diverse institutional datasets.