CLADES - Contrastive Learning Augmented DifferEntial Splicing with Orthologous Positive Pairs

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: Why Do We Need This?

Imagine your body is a massive construction site. You have a single blueprint (your DNA) that contains instructions for building everything from your heart to your brain. However, the construction crew doesn't just follow the blueprint blindly. They have a "cut-and-paste" editor that can rearrange the instructions. This is called Alternative Splicing.

By cutting and pasting different parts of the blueprint, the same gene can build a protein for a brain cell or a completely different protein for a skin cell.

The Problem:
Scientists want to predict how this editing happens. Specifically, they want to know: "If we look at a gene in a liver cell versus a brain cell, will the editor cut out a specific piece or keep it?"

The Challenge: We don't have enough labeled data. We know the DNA sequence, but we don't have enough "answers" (experimental data) for every single tissue or cell type to teach a computer how to predict these changes. It's like trying to learn a new language when you only have a dictionary but no conversation partners.

The Solution: CLADES (The Evolutionary Time Traveler)

The authors created a tool called CLADES. Instead of trying to learn from the limited human data we have, they decided to learn from evolution.

Here is the core idea, broken down with an analogy:

1. The "Twin" Analogy (Orthologous Pairs)

Imagine you have a twin who lives in a different country. You both grew up in the same family, so you share the same core personality and family rules, even if you wear different clothes or speak with a slight accent.

In biology, Orthologs are like these twins. A gene in a human and the same gene in a mouse (or a chicken, or a fish) are "twins." They have evolved from the same ancestor. Even though their DNA sequences might have changed slightly over millions of years, the job they do (the regulatory program) usually stays the same.

The Paper's Insight: If a specific DNA sequence tells a human cell to "cut this piece out," the equivalent sequence in a mouse likely tells the mouse cell to do the exact same thing.
The Strategy: CLADES treats the human sequence and the mouse sequence as a positive pair (two views of the same truth). It treats random, unrelated sequences as negatives (total strangers).

2. The "Language Learning" Analogy (Contrastive Learning)

How does the computer learn? It uses a method called Contrastive Learning.

Think of it like a teacher trying to teach a student to recognize a "Cat."

Old Way: Show the student 1,000 pictures of cats labeled "Cat" and 1,000 pictures of dogs labeled "Not Cat." (This requires a lot of labeled data).
CLADES Way: Show the student a picture of a cat in a hat and a picture of a cat in a sweater. Say, "These are the same animal." Then show a picture of a dog. Say, "This is different."
The Magic: The computer learns the essence of a "cat" (the regulatory rules) by realizing that the human version and the mouse version are "the same cat," even if they look slightly different. It learns the rules of the game rather than just memorizing the answers.

How It Works (Step-by-Step)

The Pre-Training (The Gym):
The model goes to the gym with a massive dataset of DNA sequences from many different species (humans, mice, dogs, etc.). It looks at a human gene and its "twin" in a mouse. It tries to push their digital fingerprints (embeddings) close together in a virtual space. It pushes unrelated genes far apart.
- Result: The model learns a "universal language" of how genes are regulated, based on what has survived millions of years of evolution.
The Fine-Tuning (The Specific Job):
Now, the model takes this general knowledge and applies it to a specific task: predicting how a gene behaves in a specific human tissue (like the liver). Because it already understands the deep rules of gene regulation, it only needs a tiny bit of human-specific data to get really good at the job.
The Prediction (The Crystal Ball):
The model predicts $\Delta\psi$ (Delta-Psi).
- Analogy: Imagine a volume knob on a radio. $\psi$ is the current volume. $\Delta\psi$ is how much the volume changes when you switch from "Jazz" (Brain) to "Rock" (Heart).
- CLADES predicts not just the volume, but the direction (does it get louder or quieter?) and the magnitude (does it go from a whisper to a shout?).

Why Is This a Big Deal?

It Works Where Data is Scarce: In many tissues or rare cell types, we don't have enough experimental data to train a normal AI. Because CLADES learned from evolution, it can make smart guesses even when human data is missing.
It Understands the "Why": The model didn't just memorize patterns; it learned that certain DNA "motifs" (like specific letter combinations) act as switches. When the researchers looked at what the model was paying attention to, they saw it focused on the exact spots where genes are cut and pasted (splice sites). This proves the model is learning biology, not just math.
It's Better Than the Competition: When tested against the best existing models (like MTSplice), CLADES was more accurate at predicting how genes change between different tissues and cell types.

The Limitations (The Fine Print)

The authors are honest about the flaws:

Not All Twins Are Alike: Sometimes, a gene in a human and a gene in a fish have evolved to do totally different things. The model assumes they are the same, which isn't always true.
Zoom Level: The model looks at a specific window of DNA. It might miss a regulatory switch that is far away (like a remote control button that is far from the TV).
Noisy Data: Single-cell data (looking at individual cells) is very messy, like trying to hear a whisper in a crowded stadium.

The Takeaway

CLADES is like a student who learns the rules of grammar by reading thousands of books in different languages (evolution), rather than just memorizing a few sentences in English (human data). Because they understand the deep structure of the language, they can write perfect sentences in a new context (predicting splicing in new tissues) even if they've never seen that specific context before.

It turns the history of life on Earth into a powerful teacher for our AI.

1. Problem Statement

Alternative splicing (AS) is a critical mechanism for expanding transcript and protein diversity in eukaryotes. While RNA sequencing allows for the quantification of splicing events (typically reported as percent spliced-in, $\psi$ ), predicting the change in inclusion ( $\Delta\psi$ ) between biological contexts (e.g., different tissues, cell types, or disease states) remains a significant challenge.

Limitations of Current Methods: Existing deep learning models often rely on end-to-end architectures trained on limited, noisy, and context-specific experimental labels. They struggle to generalize across tissues and cell types due to:
- Scarcity of high-quality, context-specific RNA-seq data.
- Overfitting to protocol-specific artifacts (e.g., GC content bias).
- Inability to capture the complex, non-linear dependencies between sequence motifs, RNA-binding proteins (RBPs), and positional context that govern differential splicing.
The Core Challenge: Learning sequence-to- $\Delta\psi$ mappings without extensive tissue-specific supervision requires a method that can extract generalizable regulatory principles from sequence data alone.

2. Methodology

The authors propose CLADES, a framework that leverages Contrastive Learning (CL) grounded in evolutionary conservation to learn robust sequence representations.

A. Core Hypothesis: Evolution as Augmentation

The framework operates on the hypothesis that regulatory programs governing context-dependent exon inclusion are deeply conserved across species.

Orthologous Positive Pairs (OPPs): Exon-intron junction sequences from orthologous genes across different vertebrate species are treated as semantically consistent "views" of the same regulatory program. Despite sequence divergence, the regulatory function is preserved by stabilizing selection.
Contrastive Objective: The model is trained to pull embeddings of orthologous pairs closer together while pushing apart embeddings of non-homologous (negative) junctions. This forces the model to learn invariant features (conserved motifs, RBP binding sites) rather than species-specific noise.

B. Data Processing

Pre-training Dataset: Constructed from the Multiz100way multiple sequence alignment of vertebrate genomes.
- Input: 300 bp of flanking intronic sequence + 100 bp of exonic sequence (upstream and downstream of the exon).
- Strategy: Human exons serve as "anchors," with orthologs from other species serving as positive pairs. Non-homologous sequences in the batch serve as negatives.
- Exclusion: Exons used in downstream fine-tuning (ASCOT/Tabula Sapiens) were strictly excluded from pre-training to prevent data leakage.
Fine-tuning Datasets:
- ASCOT: 56 human tissues (GTEx data) for tissue-specific $\Delta\psi$ prediction.
- Tabula Sapiens: 112 cell types (single-cell RNA-seq) for cell-type-specific $\Delta\psi$ prediction.

C. Model Architecture

Encoder: Based on the MTSplice architecture, utilizing parallel Convolutional Neural Networks (CNN) and Spline transformation layers to capture position-dependent sequence motifs for both 5' and 3' contexts.
Contrastive Head: A two-layer projection head maps pooled feature maps into an embedding space ( $d_{emb}=128$ ).
Loss Function: Supervised Contrastive Loss (SupCon). Unlike standard self-supervised CL (which uses two augmented views of one sample), CLADES uses multiple orthologous sequences as positive pairs per anchor.
$L_{sup} = -\frac{1}{|I|} \sum_{i \in I} \frac{1}{|P(i)|} \sum_{p \in P(i)} \log \frac{\exp(z_i^\top z_p / \tau)}{\sum_{c \in C(i)} \exp(z_i^\top z_c / \tau)}$
Fine-tuning: The pre-trained encoder is frozen or fine-tuned with a lightweight supervised head to predict $\Delta\psi$ . The objective minimizes the Kullback–Leibler (KL) divergence between predicted and observed inclusion levels.

D. Novel Classification Frameworks

To make predictions biologically interpretable, the authors introduce two classification tasks:

Tissue-Specific Regulation Classification (TSRC): Classifies exons as Up-regulated ( $\Delta\psi^+$ ), Down-regulated ( $\Delta\psi^-$ ), or Unchanged ( $\Delta\psi^\emptyset$ ) relative to their mean inclusion.
Exon-Level Regulation Classification (ELRC): Specifically evaluates the model's ability to detect repression in highly included exons or activation in weakly included exons.

3. Key Contributions

Evolutionary Contrastive Learning: A novel pre-training strategy that uses orthologous sequences as positive pairs to learn invariant regulatory representations without requiring tissue-specific labels.
Generalization to Unseen Contexts: The model learns general regulatory principles that transfer effectively to diverse tissues and cell types, outperforming state-of-the-art (SOTA) models that rely on tissue-specific training data.
Interpretable Framework: Introduction of TSRC and ELRC tasks, shifting the focus from pure regression to biologically meaningful direction-of-change classification.
Saliency Analysis: Demonstration that the learned embeddings focus on canonical splice-site motifs (AG at acceptors, GT at donors) and conserved regulatory signals, validating the biological relevance of the learned features.

4. Results

The model was evaluated on the ASCOT (56 tissues) and Tabula Sapiens (112 cell types) datasets.

Regression Performance ( $\Delta\psi$ Prediction):
- Tissue Level: CLADES achieved higher Spearman correlation ( $\rho$ ) than the SOTA model (MTSplice) across nearly all 56 tissues. It showed particular robustness in tissues with limited data samples.
- Cell Level: CLADES outperformed the non-contrastive baseline in medium and high-sample cell type categories, achieving a $\rho$ of 0.75 in Basal cells.
Classification Performance:
- TSRC: CLADES showed significant improvements in AUPRC and AUROC for distinguishing up-regulated vs. unchanged exons (e.g., a 14% increase in AUPRC for the 300bp intron+exon configuration).
- ELRC: The model improved precision and F1-scores for detecting context-dependent repression in highly included exons and activation in weakly included exons.
Ablation Studies:
- Input Window: 200bp windows favored regression accuracy (magnitude), while 300bp windows improved classification sensitivity (direction).
- Augmentation: Increasing the number of orthologous augmentations (from 5 to 10) consistently improved representation quality.
- Intron-only vs. Intron+Exon: While intron-only models benefited from CL, adding exonic sequence provided the best overall performance, highlighting the importance of splice-site context.

5. Significance and Future Directions

Biological Insight: The study confirms that evolutionary conservation is a powerful signal for learning splicing regulation. It demonstrates that "evolution-as-augmentation" can replace or augment expensive experimental labeling.
Scalability: The approach reduces dependence on noisy, context-specific annotations, making it feasible to predict splicing in rare cell types or disease states where data is scarce.
Limitations & Future Work:
- Assumes orthology implies regulatory equivalence (which may fail for rapidly evolving or lineage-specific exons).
- Current fixed windows may miss distal regulatory elements.
- Future work aims to integrate multimodal data (RBP binding, nucleosome positioning) and develop phylogeny-aware augmentation strategies to build a "splicing regulatory foundation model."

In conclusion, CLADES establishes a new paradigm for splicing prediction by leveraging evolutionary relationships to learn robust, transferable sequence representations, significantly advancing the ability to predict context-specific differential splicing.