CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data

This paper proposes CFCML, a novel coarse-to-fine crossmodal learning framework that bridges the modality gap between medical images and tabular data through multi-granularity feature exploration and a hierarchical anchor-based relationship mining strategy, achieving state-of-the-art disease diagnosis performance on MEN and Derm7pt datasets.

Tianling Liu, Hongying Liu, Fanhua Shang, Lequan Yu, Tong Han, Liang Wan

Published 2026-03-23
📖 5 min read🧠 Deep dive

Imagine you are a detective trying to solve a complex medical mystery. To catch the culprit (the disease), you have two types of clues:

  1. The Visual Clues: High-tech medical images (like MRI scans) that show where the problem is and what it looks like.
  2. The Written Clues: A patient's chart (tabular data) with facts like age, symptoms, and medical history.

The problem is, these two types of clues speak completely different languages. The images are huge, detailed, and visual, while the charts are short, text-based, and factual. Trying to combine them is like trying to mix oil and water; they don't naturally blend, and often, the computer gets confused, missing important details.

This paper introduces a new detective tool called CFCML (Coarse-to-Fine Crossmodal Learning). Think of it as a super-smart translator and organizer that helps the computer understand both clues perfectly. Here is how it works, broken down into simple steps:

1. The "Coarse" Stage: The Rough Draft

The Problem: If you try to compare a whole MRI scan (which has thousands of tiny pixels) with a short sentence about a patient's age, it's an unfair fight. The image is too loud, and the text is too quiet. Also, early parts of an image show general shapes, while deep parts show specific details.

The Solution (MG-CIE Module):
Imagine you are organizing a library.

  • The "Coarse" step: Instead of trying to read every single book (pixel) at once, the system creates a "summary" of the image at different levels of detail. It looks at the big picture first, then zooms in.
  • The Translation: It takes the short text clues and expands them slightly, and it condenses the massive image data. It forces them to be the same "size" so they can talk to each other.
  • The Result: It creates a rough draft where the image and the text start to understand each other. It's like having a translator who says, "Okay, this picture of a tumor matches with this sentence about 'pain in the head'."

2. The "Fine" Stage: The Expert Review

The Problem: Even after the rough draft, the computer might still be distracted by irrelevant details. It might focus on the background of the photo instead of the tumor, or get confused by a patient's name instead of their age. It needs to focus only on what matters for the specific disease.

The Solution (CCRM Strategy):
This is where the system gets really smart. It uses a technique called "Class-Aware Relationship Mining."

  • The "Class" Concept: Think of "classes" as different types of suspects (e.g., "Benign Tumor" vs. "Cancer").
  • The "Anchors": The system creates three types of "anchors" (like magnets) to pull the right clues together:
    1. Sample Anchors: It looks at individual patients and asks, "Does this patient's image and chart look like other patients with the same disease?"
    2. Unimodal Anchors: It creates a "perfect average" image for a specific disease and a "perfect average" chart for that same disease.
    3. Crossmodal Anchors: It creates a "perfect average" combination of both image and chart for that disease.

The Magic: The system uses these anchors to pull all the "Cancer" clues together into one tight group and push all the "Benign" clues far away. It's like a bouncer at a club who only lets people with the same VIP pass (disease type) into the same room, regardless of whether they arrived by car (image) or bus (text). This forces the computer to ignore the noise and focus only on the features that actually define the disease.

Why is this better than before?

Previous methods were like trying to glue the image and text together with duct tape. They often missed the small, important details in the images or ignored the specific context of the text.

  • Old Way: "Here is a picture, here is a list of numbers. Let's just smash them together."
  • New Way (CFCML): "Let's first translate them so they speak the same language (Coarse), and then let's organize them by who they belong to, grouping all the 'Cancer' cases together and separating them from the 'Healthy' cases (Fine)."

The Results

The researchers tested this on two real-world medical datasets:

  1. Brain Tumors (MEN): Diagnosing different grades of meningiomas.
  2. Skin Lesions (Derm7pt): Distinguishing between moles and melanoma.

The new method beat all the previous "State-of-the-Art" (SOTA) methods. It was more accurate, better at spotting the tricky cases, and could clearly see the disease areas in the images (proven by heatmaps showing exactly where the computer was looking).

In a Nutshell

CFCML is a two-step process that teaches a computer to stop treating medical images and patient charts as strangers. First, it translates them into a common language. Second, it organizes them into tight-knit groups based on the specific disease, ensuring the computer learns exactly what to look for to make a life-saving diagnosis.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →