Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

This paper proposes a transformer-based framework for skin cancer case retrieval that effectively combines reference images and textual descriptors by learning hierarchical representations and performing joint global-local alignment, thereby achieving state-of-the-art performance on the Derm7pt dataset to support clinical decision-making.

Yuheng Wang, Yuji Lin, Dongrun Zhu, Jiayue Cai, Sunil Kalia, Harvey Lui, Chunqi Chang, Z. Jane Wang, Tim K. Lee

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to solve a mystery, but instead of a crime scene, you are looking at a skin lesion (a spot on the skin) to figure out if it's dangerous (skin cancer) or harmless.

In the past, doctors had two main ways to get help from a computer:

  1. Show a picture: "Find me other spots that look exactly like this one." (Like using Google Images).
  2. Read a description: "Find me spots that are black, have jagged edges, and are growing fast." (Like a text search).

But in real life, doctors rarely use just one or the other. They usually say, "Look at this specific spot, and tell me if it matches any cases that have these specific features."

This paper introduces a new AI system designed to do exactly that. Here is how it works, broken down into simple concepts:

1. The "Hybrid Detective" (Composed Retrieval)

Think of the old systems as detectives who are either blind (only looking at photos) or illiterate (only reading text). This new system is a hybrid detective that can do both at the same time.

When a doctor asks a question, they give the AI a "package":

  • The Photo: A picture of the patient's skin spot.
  • The Clue Card: A short text note describing what the doctor sees (e.g., "irregular border," "blue-white veil").

The AI combines these two into a single, powerful search query. It's like telling a librarian, "Find me a book that looks like this cover, but has a plot twist described in this sentence."

2. The "Zoom Lens" vs. The "Wide Angle" (Global & Local Alignment)

The biggest challenge in skin cancer is that the most important clues are often tiny details hidden inside a larger picture.

  • The Wide Angle (Global): The AI looks at the whole picture to understand the general vibe. Is the spot big? Is it red overall? This ensures the AI doesn't get confused by completely different types of skin.
  • The Zoom Lens (Local): The AI uses a special "magic magnifying glass" to zoom in on tiny, specific spots. It looks for the "smoking guns"—like a tiny streak of black pigment or a weird texture—that actually determine if it's cancer.

The Analogy: Imagine trying to identify a suspect in a crowd.

  • Global tells you, "He's a tall man wearing a red hat."
  • Local tells you, "He has a specific scar on his left cheek and is holding a blue umbrella."
  • The Problem: If you only look at the hat, you might pick the wrong person. If you only look at the scar, you might miss the person if they are wearing a different hat.
  • The Solution: This AI does both. It checks the hat (global) to make sure it's the right crowd, but it prioritizes the scar (local) to make the final decision.

3. The "Weighted Score" (The Final Decision)

Once the AI finds potential matches, it has to decide which one is the best. It uses a special scoring system.

Think of it like a jury.

  • The "Global" evidence is the general character of the suspect.
  • The "Local" evidence is the specific crime details.

In skin cancer, the specific details (the local evidence) are usually the most important for a diagnosis. So, the AI's "Judge" gives a heavier weight to the local clues. However, it doesn't ignore the global clues completely, because you still need to make sure the whole picture makes sense.

4. Why This Matters (The Result)

The researchers tested this system on a public database of skin images (Derm7pt).

  • Old methods were like guessing based on a blurry photo or a vague description. They got it right about 77-78% of the time for the very first guess.
  • This new method got it right about 79.3% of the time for the very first guess.

While that number might look small, in the medical world, getting the very first answer right is huge. It means a doctor can immediately see a similar, confirmed case from the past, helping them make a faster, more confident decision for their patient.

Summary

This paper is about teaching an AI to be a better medical assistant. Instead of just matching pictures or just matching words, it learns to look at a picture while reading a description, zooming in on the tiny, dangerous details while keeping an eye on the big picture. This helps doctors find the most relevant past cases faster, leading to better care for patients with skin cancer.