An Integrated Deep Learning Framework for Small-Sample Biomedical Data Classification: Explainable Graph Neural Networks with Data Augmentation for RNA sequencing Dataset

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to teach a computer to spot a specific type of kidney cancer called Chromophobe Renal Cell Carcinoma (KICH). The computer needs to look at a massive library of genetic instructions (RNA) to make this diagnosis.

However, there are two huge problems:

The Library is Too Big: There are nearly 20,000 genes to read, but the computer can only handle a few at a time. It's like trying to find a needle in a haystack the size of a mountain.
The Library is Empty: We only have a tiny number of patient samples (about 91). It's like trying to learn how to drive a car by sitting in the driver's seat for only 10 minutes. The computer gets confused and makes mistakes because it hasn't seen enough examples.

This paper presents a clever "Integrated Deep Learning Framework" to solve these problems. Here is how they did it, explained with simple analogies:

1. Cleaning the Mess (Preprocessing)

Before the computer can learn, the data is messy. It's like a photo that is too bright, too dark, and full of static.

What they did: They filtered out the "noise" (genes that don't matter), normalized the brightness (so every gene is measured fairly), and transformed the data so the computer can understand it.
The Analogy: Imagine you are preparing a salad. You wash the lettuce, remove the wilted leaves, and chop everything into uniform sizes so the salad tastes consistent.

2. Making More Ingredients (Data Augmentation)

Since they didn't have enough patient samples, they had to create "fake" but realistic ones to teach the computer.

The Problem: If you only show a child 5 pictures of a cat, they might think all cats are orange. They need to see 50 pictures of different cats to learn what a cat really is.
The Solution: They used three different "recipe" tricks to create new samples:
- Linear Interpolation: Like taking two photos of a cat (one orange, one black) and blending them to create a "muddy" orange-brown cat.
- SMOTE: Like finding a group of similar cats and creating a new one by mixing their features.
- MixUp: This was the star of the show. It takes two different samples and blends them together (like mixing red and blue paint to make purple) to teach the computer that the line between "healthy" and "sick" isn't always a hard wall; sometimes it's a gradient.

3. The Three Student Models (Deep Learning Architectures)

The researchers tested three different "students" (AI models) to see who could learn the best:

MLP (Multi-Layer Perceptron): The Traditional Student. It's a standard, reliable student who has been around for a long time. It does a good job but isn't the fastest or most creative.
KAN (Kolmogorov-Arnold Network): The Efficient Genius. This is a brand-new type of student. Instead of using a rigid textbook, it uses flexible, custom-made tools (splines) to solve problems. It learns faster, uses less energy, and is easier to understand how it thinks.
GNN (Graph Neural Network): The Social Networker. This is the winner. Unlike the others who look at genes in isolation, the GNN looks at how genes "talk" to each other. It builds a map of relationships (a graph) where genes are friends. If Gene A is friends with Gene B, and Gene B is sick, the GNN knows Gene A is likely involved too. It understands the context.

4. The Results: Who Won?

When they combined the "MixUp" recipe (creating blended samples) with the "Social Networker" (GNN), the results were incredible.

Accuracy: The model got 99.47% correct. That's like a doctor diagnosing 100 patients and getting 99 of them right.
Why it won: The GNN understood the complex relationships between genes, and the MixUp data gave it enough practice to stop guessing and start knowing.

5. The "Why" (Explainable AI)

Usually, AI is a "Black Box." You put data in, and a result comes out, but you don't know why. In medicine, knowing why is crucial.

The Solution: They used XAI (Explainable AI) to open the box. They asked the GNN, "Which genes made you decide this patient has cancer?"
The Discovery: The AI pointed to 20 specific genes (like HNF4A, DACH2, NAT2).
The Validation: The researchers checked medical literature and found that these exact genes are already known to be involved in kidney cancer. This proved the AI wasn't just guessing; it had found real biological truths. It's like the AI saying, "I know this is a fire because I smell smoke, see flames, and feel heat," and the scientists saying, "Yes, those are exactly the signs of a fire."

6. The Takeaway

This paper shows that by:

Cleaning the data,
Creating smart "practice" samples (augmentation),
Using a model that understands relationships (GNN), and
Asking the model to explain its logic (XAI),

We can build powerful tools to diagnose rare cancers even when we have very little data. It turns a "small sample" problem into a "big breakthrough" opportunity, offering hope for faster, more accurate, and more trustworthy medical diagnoses in the future.

1. Problem Statement

The paper addresses the critical challenges in applying deep learning to RNA-Seq data, specifically for the classification of rare cancer subtypes like Kidney Chromophobe Renal Cell Carcinoma (KICH). The primary obstacles identified are:

High Dimensionality: RNA-Seq datasets contain tens of thousands of genes (features) but often have very few samples (e.g., 91 samples in the KICH dataset), leading to the "curse of dimensionality."
Small Sample Sizes: Limited data availability causes overfitting and poor model generalization.
Class Imbalance: Rare subtypes often result in skewed class distributions.
Lack of Interpretability: Deep learning models often function as "black boxes," making it difficult to extract biologically meaningful insights (e.g., identifying specific biomarker genes) required for clinical acceptance.

2. Methodology

The authors propose a comprehensive pipeline integrating feature engineering, data augmentation, and advanced deep learning architectures with Explainable AI (XAI).

A. Data Sources and Preprocessing

Primary Dataset: KICH RNA-Seq data from The Cancer Genome Atlas (TCGA) (91 samples: 66 cancer, 25 healthy).
Validation Dataset: Cervical cancer miRNA data (58 samples) and synthetic datasets generated via Negative Binomial (NB) distributions.
Preprocessing Steps:
1. Filtering: Removal of low-count and low-variance genes.
2. Normalization: Median ratio normalization using the DESeq2 package to correct for sequencing depth and gene length.
3. Transformation: Log2 transformation ( $log_2(X + 1)$ ) to handle zero counts and reduce skewness.

B. Feature Engineering

To reduce dimensionality, the study employed a multi-stage selection process:

Principal Component Analysis (PCA): Used for initial dimensionality reduction.
Feature Selection: Applied Boruta (a random forest-based wrapper) and Random Forest (RF) (embedded method) on both raw and PCA-reduced data.
Resulting Datasets: Four distinct feature sets were created: Boruta (87 genes), PCABoruta (120 genes), RF (201 genes), and PCARF (101 genes).

C. Data Augmentation Strategies

To address the small sample size, three augmentation techniques were applied exclusively to the training set (to prevent data leakage):

Linear Interpolation: Generating synthetic samples by interpolating between existing samples of the same class.
SMOTE (Synthetic Minority Over-sampling Technique): Creating synthetic samples by interpolating between a sample and its nearest neighbors.
MixUp: Creating new samples by linearly combining features and labels from two different samples (potentially across classes) using a Beta distribution coefficient ( $\lambda$ ).

D. Deep Learning Architectures

The study compared three distinct models:

MLP (Multi-Layer Perceptron): A standard feed-forward neural network used as a baseline.
KAN (Kolmogorov-Arnold Network): A novel architecture based on the Kolmogorov-Arnold representation theorem. Unlike MLPs with fixed activation functions, KANs use learnable spline-based univariate functions on edges, offering higher interpretability and parameter efficiency.
GNN (Graph Neural Network): Specifically a Graph Convolutional Network (GCN). A gene-gene co-expression graph was constructed using Pearson correlation ( $|r| > 0.8$ ). The GNN aggregates information from neighboring genes to capture topological dependencies, making it suitable for biological network data.

E. Explainable AI (XAI)

The best-performing model (GNN) was analyzed using GNN-XAI techniques.
This involved identifying the top 20 most influential genes driving the classification decisions.
Pathway Analysis: The top genes were subjected to KEGG pathway enrichment analysis to validate biological relevance.

3. Key Contributions

Integrated Framework: Proposed a unified pipeline combining rigorous feature selection, diverse data augmentation, and novel deep learning models specifically tailored for small-sample, high-dimensional RNA-Seq data.
Evaluation of KAN: One of the first applications of Kolmogorov-Arnold Networks (KAN) in biomedical RNA-Seq classification, demonstrating its potential as a computationally efficient and interpretable alternative to MLPs.
Superior Performance of GNN: Demonstrated that Graph Neural Networks, when combined with data augmentation, significantly outperform traditional MLPs and KANs in this specific domain by leveraging gene co-expression topology.
Biological Interpretability: Successfully bridged the gap between high-accuracy prediction and biological insight by identifying specific biomarkers (e.g., HNF4A, DACH2, MAPK15, NAT2) and validating them against known literature and pathway databases.

4. Results

Synthetic Data Validation: Initial tests on synthetic NB-distributed data confirmed that augmentation (particularly Linear Interpolation and SMOTE) significantly improved model performance over baselines.
Classification Performance (KICH Dataset):
- Best Model: GNN combined with MixUp augmentation and RF feature selection.
- Metrics: Achieved 99.47% Accuracy and an F1-Score of 0.9948.
- Comparison:
  - GNN consistently outperformed MLP and KAN, especially when augmented.
  - KAN showed strong performance (up to 99.47% with Linear Interpolation) but required fewer parameters than GNN.
  - MLP achieved high accuracy (up to 99.47% with SMOTE/MixUp) but was generally less robust than GNN across different feature sets.
Validation: The framework was validated on a Cervical Cancer dataset, where MixUp augmentation improved accuracy from 96.67% to 97.50%, confirming generalizability.
XAI Findings:
- Identified top 20 genes, including HNF4A, DACH2, MAPK15, and NAT2.
- Pathway Enrichment: These genes were significantly enriched in Drug Metabolism (involving CYP2B6, NAT2) and Caffeine Metabolism, aligning with known metabolic alterations in renal cancer.
- Differential expression analysis confirmed the biological plausibility of the selected features, showing clear segregation between tumor and normal tissues.

5. Significance and Conclusion

Clinical Relevance: The study provides a robust, data-driven approach for diagnosing rare renal cancer subtypes where data is scarce. The high accuracy (99.47%) suggests potential for clinical decision support.
Interpretability: By moving beyond "black box" predictions, the GNN-XAI approach identifies specific, biologically plausible biomarkers, increasing trust in AI-driven diagnostics.
Methodological Insight: The paper highlights that Graph Neural Networks are particularly well-suited for RNA-Seq data because they can model the inherent biological relationships (co-expression) between genes, a feature standard MLPs miss.
Future Directions: The authors suggest optimizing GNN architectures for computational efficiency and integrating multi-omics data (DNA, proteomics) to further enhance model robustness and biological coverage.

In summary, this paper demonstrates that integrating data augmentation with Graph Neural Networks and Explainable AI creates a powerful framework for overcoming the data scarcity and interpretability challenges inherent in small-sample biomedical genomics.