Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning

Imagine you walk into a massive, chaotic library where millions of books (medical scans) are stacked in piles. Each pile represents a specific "series" of images taken of a patient. The problem? The labels on the spines are often torn off, written in different languages, or completely missing. Sometimes, the librarian (the hospital system) wrote "MRI of the liver" on one book and "Abdomen scan" on another, even though they are the same thing.

Doctors and AI researchers need to sort these piles quickly and accurately to diagnose patients. If they sort them wrong, the wrong tests get run, or the diagnosis gets missed.

This paper presents a new, super-smart librarian assistant designed to solve this mess. Here is how it works, broken down into simple concepts:

1. The Problem: The "Broken Label" Dilemma

Traditionally, computers tried to sort these medical image piles in two ways:

The "Look at the Picture" approach: The computer looks at the actual images (the MRI slices) to guess what they are. This is good, but it's like trying to guess the plot of a movie just by looking at one random frame. It misses the big picture.
The "Read the Label" approach: The computer reads the digital tags (metadata) attached to the files. This is fast, but often the tags are missing, contradictory, or written in a confusing shorthand. It's like trying to sort books when half the spines have no writing at all.

2. The Solution: A "Super-Team" Approach

The authors built a system that acts like a detective team with two specialists who talk to each other constantly.

Specialist A: The Visual Detective (The Image Encoder)

This specialist looks at the actual pictures. But instead of just looking at one picture, they use a clever trick called 2.5D.

The Analogy: Imagine you have a loaf of bread (the 3D organ). Instead of eating the whole loaf at once (which is hard for a computer) or just looking at one crumb (one slice), this specialist takes 10 evenly spaced slices from the loaf.
The Magic: These slices talk to each other. If Slice 3 looks like a liver, it asks Slice 7, "Hey, does that look like a liver too?" This helps the system understand the 3D shape without getting overwhelmed by data.

Specialist B: The Label Detective (The Sparse Metadata Encoder)

This specialist reads the digital tags. But here is the genius part: They don't panic when tags are missing.

The Analogy: Imagine you are trying to identify a person based on a description card. If the card says "Height: 6ft" but leaves "Hair Color" blank, a normal computer might get confused or try to guess (impute) the hair color, which often leads to errors.
The Innovation: This specialist uses a "Dictionary" approach. It only looks at the information that is there. If "Hair Color" is missing, it simply ignores that slot and focuses on "Height" and "Age." It doesn't try to fill in the blanks; it just uses what it has. This makes it incredibly robust against messy, incomplete data.

3. The Secret Sauce: The "Two-Way Conversation" (Cross-Attention)

In older systems, the Visual Detective and the Label Detective would work alone and then just slap their notes together at the end. It was like two people shouting across a room without listening.

In this new system, they use Bi-Directional Cross-Attention.

The Analogy: Imagine the Visual Detective is looking at a blurry picture of a liver. They turn to the Label Detective and ask, "Does the tag say 'Contrast Phase: Late'?"
The Label Detective replies, "Yes, it does! That means the dark spots you see are likely blood vessels, not tumors."
The Visual Detective then says, "Ah, got it. And you, the Label Detective, you see a tag that says 'Axial Plane.' Does that mean the slice I'm looking at is a cross-section?"
The Label Detective nods, "Yes, that explains the shape."

They constantly refine each other's understanding. If the image is ambiguous, the metadata helps. If the metadata is missing, the image fills the gap. They create a single, unified "series-level" decision.

4. The Results: Why It Matters

The team tested this system on thousands of liver MRI scans from different hospitals.

The Score: It got a 96.6% accuracy rate, beating every other method they tried.
The Robustness: Even when they tested it on data from a completely different hospital (where the labels were written differently), it still performed incredibly well.
The Lesson: They proved that you don't need to "fix" missing data (imputation). In fact, trying to guess missing data often makes things worse. It's better to have a system that knows how to work with what it has.

Summary

Think of this paper as introducing a smart sorting machine for medical scans. Instead of relying on broken labels or just guessing from pictures, it uses a team of experts that constantly chat with each other. One looks at the pictures, the other reads the tags, and they only use the information that is actually there. This makes them incredibly good at organizing medical data, even when the data is messy, incomplete, or comes from different places.

1. Problem Statement

Automated identification of DICOM (Digital Imaging and Communications in Medicine) image series is a critical prerequisite for large-scale medical image analysis, quality control, and protocol harmonization. However, current approaches face three primary challenges:

Heterogeneous Slice Content: Series contain variable numbers of slices with different orientations and spacings.
Variable Series Length: DICOM series do not have a fixed number of images, complicating batch processing and 3D modeling.
Metadata Sparsity and Inconsistency: DICOM metadata is often missing, incomplete, vendor-dependent, or manually edited. Traditional methods relying on metadata imputation (filling in missing values) introduce noise and errors, while image-only methods fail to leverage valuable acquisition context.

Existing multimodal solutions often use two-stage pipelines (training separate models and combining predictions) or rely on dense metadata vectors with imputation, which prevents joint representation learning and degrades performance when data is missing.

2. Methodology

The authors propose an end-to-end multimodal framework that jointly models image content and acquisition metadata without requiring data imputation. The architecture consists of three main components:

A. Visual Feature Pathway (2.5D Encoder)

Input Handling: From a series of $N$ slices, the model subsamples $S$ equidistant slices to handle variable series lengths.
Preprocessing: Slices are center-cropped (224×224), Z-score normalized, and passed through a backbone (DenseNet121).
Cross-Slice Attention: Instead of aggregating slices independently, the model employs a 2.5D visual encoder using self-attention. This allows each slice representation to attend to all other sampled slices, capturing global volumetric context and emphasizing relevant content while down-weighting redundancy.

B. Sparse Metadata Encoder (SME)

Missingness-Aware Design: The encoder treats metadata as a set of observed index-value pairs rather than a dense vector. Missing entries are represented as NaNs and ignored, eliminating the need for imputation.
Learnable Dictionaries & FiLM:
- Each metadata feature index is associated with a learnable embedding.
- A Value Network (a small MLP) predicts FiLM (Feature-wise Linear Modulation) parameters ( $\alpha, \beta$ ) based on the feature value and its embedding.
- The feature embedding is modulated: $\tilde{e} = e \odot (1 + \alpha) + \beta$ . This contextualizes the scalar value with its semantic identity.
Aggregation: Modulated embeddings are aggregated via average pooling into a fixed-size representation, regardless of the number of observed attributes.

C. Bi-Directional Cross-Modal Attention (BCA)

Fusion Mechanism: The visual embeddings ( $V$ $V$ ) and metadata embeddings ( $M$ $M$ ) are projected into a shared dimension and fused using Bi-Directional Multi-Head Attention.
- $V$ attends to $M$ (Image queries, Metadata keys/values).
- $M$ attends to $V$ (Metadata queries, Image keys/values).
Contextualization: This allows visual features and metadata to reciprocally modulate each other across slices, creating a rich, cross-modal context.
Series-Level Representation: A learnable weighting function aggregates the fused slice-level embeddings into a single series-level vector for classification.

3. Key Contributions

End-to-End Multimodal Framework: A unified architecture that learns joint representations from images and metadata using bi-directional cross-attention, avoiding the pitfalls of two-stage pipelines.
Sparse Metadata Encoder (SME): A novel encoder using learnable dictionaries and FiLM that handles missing metadata natively without imputation, making the system robust to incomplete DICOM headers.
Flexible 2.5D Visual Encoder: A mechanism that handles variable series lengths and slice dimensions by sampling equidistant slices and applying cross-slice attention to capture volumetric context efficiently.
Comprehensive Evaluation: Extensive testing on both in-domain (Duke Liver MRI) and out-of-domain (multi-institutional in-house) datasets, demonstrating superior generalization compared to unimodal and existing multimodal baselines.

4. Experimental Results

The method was evaluated on the Duke Liver MRI dataset (2,146 series, 13 classes) and a large in-house cohort (82,134 series).

In-Domain Performance (Duke Dataset):
- The proposed method achieved a weighted F1 score of 96.66%, significantly outperforming all baselines.
- Baselines Comparison:
  - Image-only (3D ResNet18): 88.33%
  - Metadata-only (XGBoost): 74.71%
  - Joint (Concat + Learned Imputation): 93.21%
  - Proposed (SME + BCA): 96.66%
- The results show that the proposed method improves performance by ~3.4% over the best concatenation baseline, proving the value of learned sparsity-aware representations and dynamic modality interaction.
Out-of-Domain Generalization:
- When trained on the in-house cohort and tested on the Duke dataset, the model maintained high performance for sequence types (T2, DWI, ADC) and acquisition planes.
- Some performance drops were observed in specific contrast phases (e.g., Portal Venous) and Dixon opposed-phase, likely due to concept shifts in protocol definitions between institutions. However, the model remained robust compared to baselines.
Ablation Study:
- Varying the number of sampled slices ( $S$ ) showed that $S=10$ yielded optimal results, confirming that cross-modal attention benefits from multiple tokens to align representations.

5. Significance and Conclusion

This work demonstrates that explicitly modeling metadata sparsity and cross-modal interactions significantly improves the robustness of DICOM series classification.

Practical Impact: By removing the need for metadata imputation, the method is more reliable in real-world clinical settings where data quality varies.
Efficiency: The 2.5D approach balances the computational cost of full 3D modeling with the need for volumetric context.
Future Directions: The authors note that while the method is robust, specific classes with high concept shifts (e.g., specific contrast phases across vendors) remain challenging. Future work could explore confidence-aware fusion and richer parsing of protocol strings.

In summary, the paper presents a state-of-the-art solution for medical image series classification that effectively bridges the gap between visual data and often unreliable metadata, setting a new standard for automated medical image analysis pipelines.