RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion

Imagine your brain is a massive, bustling city with 116 distinct neighborhoods (the Regions of Interest, or ROIs). In a healthy city, traffic flows smoothly, and the neighborhoods talk to each other in a coordinated rhythm. In a city with a disorder like ADHD or Autism, the traffic lights might be broken, some neighborhoods are screaming while others are silent, and the connections between them are chaotic.

For a long time, doctors and AI have tried to diagnose these "brain cities" by looking at the raw traffic data (fMRI scans). But this data is messy, noisy, and hard to read. It's like trying to understand a city's problems by staring at a million raw numbers on a spreadsheet without any context.

The paper you shared introduces RTGMFF, a new "Brain Detective" system that solves this problem in three clever ways. Think of it as a team of three specialists working together:

1. The Translator: Turning Data into a Story (ROI-driven Text Generation)

The Problem: Computers are great at math but bad at "feeling" the story behind the numbers. Most AI models just look at the raw brain scan numbers and try to guess the disease. They miss the context. Also, doctors love reading reports, not spreadsheets.

The Solution: The first part of RTGMFF is a Translator.

It looks at the activity in each of the 116 brain neighborhoods.
Instead of just keeping the numbers, it converts them into simple, readable English sentences.
Analogy: Imagine a translator who looks at a chaotic traffic report and writes a clear sentence: "The downtown district (Frontal Lobe) is in a panic (high activity), while the library (Temporal Lobe) is asleep (low activity)."
It also adds the patient's age and gender to the story, because a 10-year-old's brain behaves differently than a 40-year-old's.
Why it helps: By turning complex data into a "story," the AI can understand the meaning of the brain activity, not just the math.

2. The Super-Scout: Seeing the City in 4D (Hybrid Frequency-Spatial Encoder)

The Problem: Previous AI models were like cameras that only took photos. They saw where things were happening (spatial) but missed how the activity was vibrating or changing over time (frequency). It's like trying to understand a song by looking at a photo of the sheet music; you miss the rhythm and the melody.

The Solution: The second part is a Super-Scout with special glasses.

The Wavelet-Mamba Branch: This part acts like a high-speed drone that zooms in on the "rhythm" of the brain. It uses a technique called "Wavelets" to break the brain signals down into different frequencies (like separating the bass from the treble in music). It uses a new, efficient AI structure called "Mamba" to scan these rhythms quickly without getting overwhelmed.
The Transformer Branch: This part acts like a city planner looking at the big picture. It connects the dots between distant neighborhoods to see the long-range relationships.
The Fusion: The Scout combines the "rhythm" (frequency) and the "map" (spatial) into one perfect view. It's like listening to the city's music while looking at its map simultaneously.

3. The Bridge Builder: Making the Story and the Map Agree (Adaptive Semantic Alignment)

The Problem: Now the AI has two different views of the patient: a Story (the text generated by the Translator) and a Map (the visual features from the Super-Scout). The challenge is making sure these two views agree with each other. If the story says "panic" but the map shows "calm," the AI gets confused.

The Solution: The third part is a Bridge Builder.

It forces the "Story" and the "Map" to speak the same language. It uses a special mathematical trick (Cosine Similarity) to nudge them closer together until they tell the exact same truth.
Analogy: Imagine two witnesses describing a crime. One is a poet (the text), and the other is a security camera (the image). The Bridge Builder makes sure the poet's description of the "red car" matches the camera's pixel data of the "red car." If they don't match, the system learns to adjust until they do.

The Result: A Better Diagnosis

When the researchers tested this new detective team on real-world data (ADHD and Autism datasets), it worked better than any previous method.

Accuracy: It correctly identified disorders more often than older models.
Reliability: It was better at spotting the disease when it was there (Sensitivity) and correctly saying "no disease" when it wasn't (Specificity).
Interpretability: Because it generates text, doctors can actually read why the AI made a diagnosis, making it trustworthy.

In Summary

RTGMFF is a brain diagnostic tool that doesn't just crunch numbers. It writes a story about what the brain is doing, listens to the rhythm of the brain waves, and forces the story and the rhythm to agree before making a final diagnosis. It's like upgrading from a calculator to a team of expert detectives who can read, listen, and reason all at once.

1. Problem Statement

Functional Magnetic Resonance Imaging (fMRI) is a critical tool for diagnosing brain disorders like Attention Deficit Hyperactivity Disorder (ADHD) and Autism Spectrum Disorder (ASD). However, current diagnostic methods face three primary challenges:

Data Complexity: fMRI data suffers from low signal-to-noise ratios, high dimensionality, and significant inter-subject variability.
Model Limitations: Existing CNN-based models struggle with long-range spatial dependencies, while Transformer-based models often neglect frequency-domain information and temporal dynamics, which are crucial for understanding brain function.
Lack of Contextualization: Most fMRI datasets lack textual annotations. Without natural language descriptions of regional activation and connectivity, models miss the semantic context that aids clinical interpretation and multimodal alignment.

2. Methodology: RTGMFF Framework

The authors propose RTGMFF, a unified framework that integrates automatic Region-of-Interest (ROI) text generation with multimodal feature fusion. The pipeline consists of three core components:

A. ROI-driven fMRI Text Generation (RFTG)

This module converts raw fMRI data and demographic information into deterministic, reproducible text tokens, bridging the gap between imaging data and semantic understanding.

ROI Statistics: The pre-processed BOLD time-series are averaged across 116 anatomical regions (AAL-116 atlas) to generate percentage signal change values.
Discretization: Continuous activation values are converted into ordinal triplets $\langle \text{ROI}_i, \text{strength}, \text{direction} \rangle$ (e.g., "strong," "moderate," "weak" and "up/down") using thresholds ( $\tau_1, \tau_2$ ). These thresholds are optimized via nested cross-validation using Optuna to maximize downstream classification accuracy without data leakage.
Demographic Conditioning: Age and gender are encoded as a vector and applied to feature maps via Feature-wise Linear Modulation (FiLM) to ensure demographic context influences the model.
Output: The system generates JSON-encoded triplets and optionally renders them into clinician-friendly radiology-style prose using a deterministic Jinja2 template.

B. Hybrid Frequency-Spatial Encoder (HFSE)

This encoder addresses the limitation of existing models by jointly modeling frequency-domain structures and long-range spatial dependencies.

Hierarchical Wavelet-Mamba (HWM) Branch:
- Performs multi-level 2D Haar Wavelet Decomposition to extract features at different scales.
- Utilizes a Mamba-inspired SelectiveScan module to prune tokens and capture fine-grained, frequency-domain structures efficiently (linear complexity).
Cross-Scale Transformer Encoder (CSTE) Branch:
- Processes patch embeddings to capture global spatial context.
- Uses Cross-Scale Attention to fuse the local features from the HWM branch with global query sequences.
- Refines the fused representation using a Vision Transformer (ViT) backbone.
Fusion: The frequency-aware local features and global spatial features are combined into a unified visual embedding ( $Z$ ).

C. Adaptive Semantic Alignment Module (ASAM)

This module aligns the visual features (from HFSE) with the textual features (from RFTG) in a shared latent space.

Embedding: The ROI token sequence is encoded using a pretrained BioBERT model.
Projection: Both visual ( $Z$ ) and textual ( $T_{emb}$ ) embeddings are projected into a common space using learnable matrices ( $W_z, W_t$ ).
Optimization: The model minimizes a composite loss function:
$L_{total} = L_{cls} + \alpha L_{align} + \beta L_{reg}$
- $L_{cls}$ : Cross-entropy loss for classification.
- $L_{align}$ : Cosine similarity loss to minimize the modality gap.
- $L_{reg}$ : Regularization to prevent feature redundancy and ensure balanced modality learning.

3. Key Contributions

Deterministic Text Generation: Introduced a rule-based, reproducible method to convert fMRI activation statistics and demographics into text tokens, enabling multimodal learning without relying on noisy or non-existent ground-truth clinical reports.
Hybrid Frequency-Spatial Architecture: Developed a novel encoder combining Wavelet decomposition and Mamba (for frequency/local features) with Cross-Scale Transformers (for global spatial dependencies), effectively capturing both spectral and spatial brain dynamics.
Adaptive Semantic Alignment: Created a module that explicitly aligns visual and textual modalities using a regularized cosine-similarity loss, significantly narrowing the modality gap and improving diagnostic robustness.

4. Experimental Results

The framework was evaluated on two public benchmarks: ADHD-200 (ADHD diagnosis) and ABIDE (ASD diagnosis), using a rigorous leave-one-site-out cross-validation protocol.

Performance Metrics: RTGMFF outperformed state-of-the-art methods (including CNNs, GNNs, and Transformers like Swin, BrainGNN, and A-GCL) across all metrics.
- ADHD-200: Achieved 80.7% Accuracy, 79.5% Sensitivity, 81.3% Specificity, and 80.4% AUC. This represents a 2.9% accuracy gain over the previous best (A-GCL).
- ABIDE: Achieved 86.4% Accuracy, 84.5% Sensitivity, 87.5% Specificity, and 86.0% AUC. This represents a 3.5% accuracy gain over A-GCL.
Ablation Studies:
- Removing the HWM module caused a >4% drop in performance, highlighting the importance of frequency-domain analysis.
- Adding the CSTE module improved accuracy by >3%, confirming the value of global spatial modeling.
- Integrating ASAM provided an additional >2% boost, validating the effectiveness of multimodal alignment.
Hyperparameter Sensitivity: Optimal performance was found with alignment weight $\alpha=0.8$ and regularization weight $\beta=0.2$ . The optimal discretization thresholds were identified as $\tau_1=0.15$ and $\tau_2=0.30$ .

5. Significance

Clinical Interpretability: By generating deterministic, radiology-style text reports from fMRI data, RTGMFF enhances the interpretability of AI-driven diagnoses, making them more accessible to clinicians.
Multimodal Innovation: The paper demonstrates that integrating textual semantics (even synthetically generated) with visual features significantly improves diagnostic accuracy, suggesting a new paradigm for neuroimaging analysis.
Robustness: The hybrid architecture effectively handles the noise and variability inherent in fMRI data by leveraging both frequency and spatial domains, setting a new benchmark for brain disorder classification.
Reproducibility: The use of deterministic text generation and strict cross-validation protocols ensures the model's findings are reliable and reproducible across different sites.

In conclusion, RTGMFF represents a significant advancement in neuroimaging AI by unifying text generation, frequency analysis, and multimodal fusion to achieve state-of-the-art diagnostic performance for ADHD and ASD.