Joint Imaging-ROI Representation Learning via Cross-View Contrastive Alignment for Brain Disorder Classification

Imagine trying to understand a complex city. You have two very different ways to study it:

The Aerial View: You fly a drone high above the city. You see the big picture: the layout of neighborhoods, the flow of traffic, and the overall shape of the skyline. This gives you global context, but you might miss the specific details of how two specific houses are connected.
The Street Map: You get down on the ground and draw a map of specific neighborhoods (Regions of Interest or ROIs). You draw lines connecting them to show how they interact (like a bus route or a power line). This gives you detailed local connections, but you lose the sense of the city's overall shape.

For a long time, doctors and AI researchers studying brain disorders (like ADHD or Autism) have been stuck choosing between these two views. Some AI models only looked at the "Aerial View" (the whole brain scan), while others only looked at the "Street Map" (connections between specific brain parts). Both worked okay, but nobody knew if they were better together or if they were just repeating the same information.

The Problem: The "Silos"

The authors of this paper noticed that previous attempts to combine these two views were messy. It was like trying to mix a smoothie and a salad together in a blender; the result was often a mess, and it was hard to tell if the improvement came from the ingredients or just the blender itself. They needed a way to mix them cleanly and see exactly what each part contributed.

The Solution: The "Translation Bridge"

The team at Lehigh University built a new system they call Joint Imaging–ROI Representation Learning. Here is how it works, using a simple analogy:

Imagine you have two experts trying to describe the same person to a judge:

Expert A describes the person's entire body (height, build, posture).
Expert B describes the person's specific features (a scar on the chin, a unique tattoo, a limp).

In the past, these experts spoke different languages, so the judge couldn't easily compare their notes. The new system acts as a universal translator.

The Two Encoders: The system uses two specialized "translators." One translates the whole brain scan into a summary code. The other translates the brain's connection map into a summary code.
The Bridge (Contrastive Alignment): This is the magic part. The system forces these two different codes to agree with each other. It says, "If Expert A and Expert B are talking about the same person, their codes must look very similar. If they are talking about different people, the codes must look very different."
The Result: Because the system forces them to agree, it creates a shared "language" where the global view and the local view can sit side-by-side perfectly.

What They Discovered

When they tested this on real patient data (from the ADHD-200 and ABIDE datasets), they found three amazing things:

The Whole is Greater than the Sum of Parts: Just like having both an aerial view and a street map helps you navigate a city better than either one alone, combining the brain scan and the connection map made the AI significantly better at diagnosing disorders.
They See Different Things: The system didn't just double the same information. The "Aerial View" (Imaging) spotted broad patterns, while the "Street Map" (ROI) spotted specific connection issues. They were complementary, like a wide-angle lens and a zoom lens working together.
It's Robust: In the real world, sometimes a patient's scan is blurry, or a specific brain map is missing. The system is so well-trained that if one view is missing (like a foggy day for the drone), the other view can still carry the weight, keeping the diagnosis accurate.

Why This Matters

This isn't just about getting a slightly higher score on a test. By understanding how the AI makes decisions, the researchers found that the system was looking at the exact same brain areas that human doctors know are involved in ADHD and Autism (like the frontal lobe and the limbic system).

In short: This paper built a smart "bridge" that lets two different ways of looking at the brain talk to each other. By forcing them to agree, the AI learned a much richer, more complete picture of brain disorders, leading to better, more reliable diagnoses. It's the difference between trying to solve a puzzle with half the pieces versus having the whole picture clearly visible.

1. Problem Statement

Brain disorder classification using neuroimaging data (e.g., MRI) is typically approached via two distinct paradigms, each with limitations:

Global Volumetric Modeling: Uses 3D CNNs or Transformers to capture holistic anatomical context but often overlooks fine-grained inter-regional relationships.
ROI-Based Graph Modeling: Constructs graphs where nodes are brain regions and edges represent connectivity. This captures localized topology and clinical interactions but may miss global spatial context.

The Gap: While both approaches show independent efficacy, their relative contributions and potential complementarity are poorly understood. Existing fusion methods are often task-specific, confounding architectural differences with representational benefits. There is a lack of a controlled, unified framework to evaluate how these representations contribute individually and jointly under consistent training settings.

2. Methodology

The authors propose a Unified Cross-View Contrastive Framework that learns subject-level global (imaging) and local (ROI-graph) embeddings and aligns them in a shared latent space. The framework consists of three main components:

A. Dual-Branch Representation Learning

The system processes input imaging volumes ( $x_i$ ) through two modular encoders:

Imaging Branch ( $f_{img}$ ): Extracts global volumetric embeddings ( $z_{img}$ ). The authors instantiate this using 3DSC-TF (a hybrid CNN-Transformer), though the framework is backbone-agnostic.
ROI-Graph Branch ( $f_{roi}$ ): Constructs a graph $G(x_i)$ using AAL atlas parcellation. Node features are mean voxel intensities; edges are defined by pairwise Pearson correlations. The graph encoder (instantiated as NeuroGraph, a GNN) produces local embeddings ( $z_{roi}$ ).

B. Cross-View Contrastive Alignment

To align these heterogeneous representations, the authors employ a bidirectional InfoNCE objective:

Projection: Both embeddings are mapped to a shared latent space via projection heads ( $g_{img}$ and $g_{roi}$ ).
Loss Function: A bidirectional contrastive loss ( $L_{con}$ ) is applied. It treats embeddings from the same subject across different views (Imaging vs. ROI) as positive pairs and embeddings from different subjects as negative pairs.
Goal: This forces the model to learn consistent representations for the same subject across views while maintaining discrimination between different subjects.

C. Fusion and Classification

Fusion: The aligned embeddings are concatenated to form a joint representation ( $z_{fuse} = [z_{img}; z_{roi}]$ ).
Objective: The total loss combines the standard cross-entropy classification loss ( $L_{cls}$ ) and the contrastive alignment loss ( $L_{con}$ ), weighted by a hyperparameter $\lambda$ :
$L = L_{cls} + \lambda L_{con}$

3. Key Contributions

Unified Framework: Proposes a modular, cross-view contrastive framework for joint learning of volumetric and ROI-graph representations under consistent training conditions, avoiding the confounding factors of previous task-specific fusion models.
Systematic Evaluation: Provides a controlled comparison of Imaging-only, ROI-only, and Joint configurations, clarifying their individual and complementary contributions.
Empirical and Interpretability Insights: Demonstrates that joint learning consistently outperforms single-branch baselines and reveals that the two branches capture distinct yet complementary neuroanatomical patterns.

4. Experimental Results

The framework was evaluated on two public structural MRI datasets: ADHD-200 (ADHD vs. Controls) and ABIDE (Autism vs. Controls).

Performance Gains: Joint learning consistently outperformed both Imaging-only and ROI-only baselines across multiple backbone architectures (e.g., ViT3D, RAE-ViT, 3DSC-TF, DNN, NeuroGraph).
- Example (ADHD-200): The best joint model (NeuroGraph + 3DSC-TF) achieved 69.29% Accuracy, surpassing the best Imaging-only (68.65%) and ROI-only (63.48%) models.
- Example (ABIDE): The joint model achieved 62.54% Accuracy, improving over Imaging-only (59.17%) and ROI-only (61.09%).
Ablation Studies:
- Encoder Choice: Graph-based message passing (NeuroGraph) consistently outperformed independent node-level features (DNN).
- Fusion Strategy: The proposed Contrastive Alignment (Contra) fusion strategy yielded superior performance compared to simple concatenation or cross-attention mechanisms, proving that explicit alignment creates more compatible embeddings.
Robustness (Missing Views): The model was tested with 10%, 30%, and 50% of one modality masked. Performance degradation was moderate, suggesting that per-branch supervision preserves discriminative capacity and allows for implicit knowledge transfer between branches.
Interpretability (Grad-CAM):
- Imaging-only models produced diffuse activation patterns.
- ROI-only models produced sharper but heterogeneous activations.
- Joint models highlighted spatially coherent regions supported by both views, specifically in the frontal, sensorimotor, orbitofrontal, and limbic systems. These regions align with established neurobiological findings regarding ADHD and Autism, validating the clinical plausibility of the learned representations.

5. Significance

This work establishes that explicitly aligning and integrating global volumetric and ROI-level graph representations is a principled and effective strategy for neuroimaging-based diagnosis.

Scientific Impact: It moves beyond simple fusion by using contrastive learning to enforce consistency between different data views, revealing that global context and local topology are complementary rather than redundant.
Clinical Relevance: The interpretability analysis confirms that the model focuses on biologically meaningful brain circuits, enhancing trust in AI-driven diagnostic tools.
Methodological Advancement: The proposed framework offers a robust, backbone-agnostic solution that remains effective even when data modalities are partially missing, a common scenario in clinical settings.

Joint Imaging-ROI Representation Learning via Cross-View Contrastive Alignment for Brain Disorder Classification

The Problem: The "Silos"

The Solution: The "Translation Bridge"

What They Discovered

Why This Matters

1. Problem Statement

2. Methodology

A. Dual-Branch Representation Learning

B. Cross-View Contrastive Alignment

C. Fusion and Classification

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Unsupervised Point Cloud Pre-Training via Contrasting and Clustering

Conceptual Views of Neural Networks: A Framework for Neuro-Symbolic Analysis

Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning

3D-LFM: Lifting Foundation Model

Sparse Training for Federated Learning with Regularized Error Correction