Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

Imagine you are a detective trying to solve a mystery, but instead of a crime scene, you are looking at a skin lesion (a spot on the skin) to figure out if it's dangerous (skin cancer) or harmless.

In the past, doctors had two main ways to get help from a computer:

Show a picture: "Find me other spots that look exactly like this one." (Like using Google Images).
Read a description: "Find me spots that are black, have jagged edges, and are growing fast." (Like a text search).

But in real life, doctors rarely use just one or the other. They usually say, "Look at this specific spot, and tell me if it matches any cases that have these specific features."

This paper introduces a new AI system designed to do exactly that. Here is how it works, broken down into simple concepts:

1. The "Hybrid Detective" (Composed Retrieval)

Think of the old systems as detectives who are either blind (only looking at photos) or illiterate (only reading text). This new system is a hybrid detective that can do both at the same time.

When a doctor asks a question, they give the AI a "package":

The Photo: A picture of the patient's skin spot.
The Clue Card: A short text note describing what the doctor sees (e.g., "irregular border," "blue-white veil").

The AI combines these two into a single, powerful search query. It's like telling a librarian, "Find me a book that looks like this cover, but has a plot twist described in this sentence."

2. The "Zoom Lens" vs. The "Wide Angle" (Global & Local Alignment)

The biggest challenge in skin cancer is that the most important clues are often tiny details hidden inside a larger picture.

The Wide Angle (Global): The AI looks at the whole picture to understand the general vibe. Is the spot big? Is it red overall? This ensures the AI doesn't get confused by completely different types of skin.
The Zoom Lens (Local): The AI uses a special "magic magnifying glass" to zoom in on tiny, specific spots. It looks for the "smoking guns"—like a tiny streak of black pigment or a weird texture—that actually determine if it's cancer.

The Analogy: Imagine trying to identify a suspect in a crowd.

Global tells you, "He's a tall man wearing a red hat."
Local tells you, "He has a specific scar on his left cheek and is holding a blue umbrella."
The Problem: If you only look at the hat, you might pick the wrong person. If you only look at the scar, you might miss the person if they are wearing a different hat.
The Solution: This AI does both. It checks the hat (global) to make sure it's the right crowd, but it prioritizes the scar (local) to make the final decision.

3. The "Weighted Score" (The Final Decision)

Once the AI finds potential matches, it has to decide which one is the best. It uses a special scoring system.

Think of it like a jury.

The "Global" evidence is the general character of the suspect.
The "Local" evidence is the specific crime details.

In skin cancer, the specific details (the local evidence) are usually the most important for a diagnosis. So, the AI's "Judge" gives a heavier weight to the local clues. However, it doesn't ignore the global clues completely, because you still need to make sure the whole picture makes sense.

4. Why This Matters (The Result)

The researchers tested this system on a public database of skin images (Derm7pt).

Old methods were like guessing based on a blurry photo or a vague description. They got it right about 77-78% of the time for the very first guess.
This new method got it right about 79.3% of the time for the very first guess.

While that number might look small, in the medical world, getting the very first answer right is huge. It means a doctor can immediately see a similar, confirmed case from the past, helping them make a faster, more confident decision for their patient.

Summary

This paper is about teaching an AI to be a better medical assistant. Instead of just matching pictures or just matching words, it learns to look at a picture while reading a description, zooming in on the tiny, dangerous details while keeping an eye on the big picture. This helps doctors find the most relevant past cases faster, leading to better care for patients with skin cancer.

Here is a detailed technical summary of the paper "Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations."

1. Problem Formulation

The paper addresses the challenge of Skin Cancer Case Search, a specific form of medical image retrieval aimed at supporting diagnostic decision-making, education, and quality control.

The Gap: Traditional retrieval systems typically rely on either image-only queries or text-only descriptors. However, in clinical practice, dermatologists often formulate queries by combining a reference lesion image with specific textual descriptors (e.g., dermoscopic patterns or checklist criteria).
The Task: The authors define this as a Composed Vision-Language Retrieval problem. Given a query $q = (I_q, T_\tau)$ consisting of a reference image and a text description, the goal is to retrieve the top- $K$ most relevant biopsy-confirmed cases from a database $D = \{(I_n, y_n)\}$ , where $y_n$ is the diagnostic label.
The Challenge: Designing a similarity function that captures both global semantics (overall morphology and color distribution) and localized discriminative cues (specific features like streaks, irregular pigmentation, or regression structures) which are critical for accurate diagnosis but often missed by global-only embeddings.

2. Methodology

The authors propose a Transformer-based framework that learns hierarchical composed query representations and performs joint global-local alignment.

A. Hierarchical Visual Encoding

Backbone: Utilizes a Swin Transformer to extract multi-level visual feature maps ( $X_L, X_M, X_H$ ) representing low, middle, and high-level features.
Purpose: This preserves both fine-grained appearance details (crucial for local patterns) and high-level semantic context (crucial for overall morphology).

B. Cross-Modal Composition

Text Encoding: Uses BERT to encode the textual description $T_\tau$ into token embeddings $Z_\tau$ .
Fusion: A Cross-Modal Transformer injects textual information into the visual features of the reference image at each hierarchical level.
Output: Generates a composed query feature map $X^i_{q\tau}$ for each level $i \in \{L, M, H\}$ , aligned in the same visual feature space as the target database images.

C. Joint Global-Local Alignment

The core innovation lies in computing similarity through two complementary terms:

Local Alignment:
- Learns $k$ spatial attention masks ( $\alpha_j$ ) to identify discriminative regions without explicit lesion annotations.
- Computes region descriptors by applying these masks to feature maps and pooling them.
- Aggregates cosine similarity across these local regions to capture specific diagnostic cues (e.g., irregular borders).
Global Alignment:
- Computes similarity by pooling the entire feature maps and applying cosine similarity.
- Provides holistic semantic supervision to ensure overall consistency between the query and the target.
Final Similarity Score:
- Combines local and global scores via a convex, domain-informed weighting:
  $S = \beta S_{local} + (1 - \beta) S_{global}$
- $\beta$ (set to 0.6): Prioritizes local evidence (clinically salient cues) while maintaining global consistency.

3. Key Contributions

Novel Problem Formulation: This is the first study to investigate composed image retrieval specifically for skin cancer case search, pairing reference images with clinical text to retrieve biopsy-confirmed cases.
Hierarchical Framework with Joint Alignment: Introduces a method that learns multi-level composed representations and aligns them using a weighted combination of global and local similarities. This allows the model to focus on discriminative local patterns (like streaks) while preserving global morphological coherence.
State-of-the-Art Performance: Demonstrates significant improvements over existing methods on a public multimodal dataset, validating the effectiveness of the proposed similarity function.

4. Experimental Results

Dataset: Evaluated on Derm7pt, a benchmark dataset containing dermoscopic images and structured clinical attributes aligned with the 7-point checklist. The study focused on three categories: Melanoma (mel), Nevus, and Benign Keratosis-like lesions (bkl), totaling 888 images.
Baselines: Compared against ResNet50-CosSim, SNF-DCA, MaskRCNN-Fusion, and DAHNET.
Quantitative Performance:
- Accuracy@1: The proposed method achieved 79.3%, outperforming the best baseline (SNF-DCA at 77.8%) and the global-only baseline (ResNet50-CosSim at 77.6%).
- Mean Average Precision (mAP): Achieved 81.7%, surpassing all baselines (DAHNET: 80.6%, ResNet50: 80.4%).
- Accuracy@4: Reached 87.3%, showing strong retrieval capability across the top results.
Qualitative Analysis: Visual examples confirmed that the model successfully retrieves cases that are both visually similar and clinically consistent with the query's specific text descriptors (e.g., preserving malignant cues for melanoma queries).

5. Significance and Impact

Clinical Relevance: The framework bridges the gap between AI models and clinical workflows by mimicking how dermatologists actually search for cases (image + text).
Interpretability: By emphasizing local discriminative cues through attention masks, the model offers a more interpretable decision support system compared to "black box" global embedding methods.
Efficiency: The ability to retrieve relevant, biopsy-confirmed historical cases supports comparative assessment, clinician training, and quality control, potentially reducing patient morbidity through earlier and more accurate diagnosis.
Generalizability: The proposed joint alignment strategy offers a robust template for other medical retrieval tasks where specific local features are as important as overall semantic context.