Pan-cell-type prediction of splicing patterns from sequence and splicing factor expression

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine your DNA is a massive, ancient instruction manual for building a human. But here's the twist: the manual doesn't just have one set of instructions. It has a "Choose Your Own Adventure" feature called splicing.

In every cell of your body, the same DNA manual is open, but the cell decides which chapters to read and which to skip. This is how a skin cell knows to be a skin cell and a brain cell knows to be a brain cell, even though they share the exact same book. If the cell picks the wrong chapters, you get diseases like cancer or Alzheimer's.

For a long time, scientists have tried to build AI that can read this DNA manual and predict which chapters a cell will pick. But there was a big problem: Context.

The Problem: The "One-Size-Fits-All" AI

Previous AI models were like a rigid librarian. They said, "Okay, if you are a liver cell, I will use the Liver Rulebook. If you are a brain cell, I will use the Brain Rulebook."

The Flaw: This works fine if you only have a few known types of cells. But what if you have a sick cell? A cancer cell? A cell that's been zapped by a drug in a lab? These don't fit neatly into "Liver" or "Brain" boxes. The old AI couldn't handle them because it didn't know which "Rulebook" to pull off the shelf.

The Solution: PanExonNet (The "Smart Context" AI)

The researchers at GSK built a new AI called PanExonNet. Instead of having a separate rulebook for every cell type, PanExonNet has a universal translator that looks at the cell's current mood.

Here is how it works, using a simple analogy:

1. The DNA is the Script

Think of a gene as a script for a play. The script has lines, but some lines are optional.

2. The Splicing Factors are the Director

In a real theater, the Director decides which lines get cut and which scenes get added. In a cell, these "Directors" are proteins called Splicing Factors.

If the Director is tired, they might cut a whole scene.
If the Director is excited, they might add a solo.
The "mood" of the Director is determined by how many of these proteins are present in the cell.

3. The Old AI vs. The New AI

Old AI (Borzoi/Pangolin): It asked, "What kind of theater is this? Is it a Comedy Club or a Tragedy Hall?" It tried to guess the cell type first, then applied a fixed rule. If the theater was a weird mix (like a cancer cell), the AI got confused.
PanExonNet: It asks, "Who is the Director right now, and what is their energy level?" It looks at the list of proteins (the "mood") and says, "Ah, the Director is in a 'high-energy' mood, so let's keep the fast-paced scenes." It doesn't care what type of cell it is; it only cares about the current instructions the cell is giving.

Why This is a Big Deal

1. It Learns from "Weird" Cells
Because PanExonNet doesn't need to know the cell's name (e.g., "Liver"), it can learn from any cell. It can look at a cancer cell line, a cell that has been genetically tweaked in a lab, or a rare disease state, figure out the "Director's mood," and predict the outcome. It's like a translator that can understand a conversation even if the speakers are speaking a dialect you've never heard before, as long as you know their tone of voice.

2. It Reads the "Fine Print"
Previous models were good at reading the main text (the overall gene expression). PanExonNet is like a super-sleuth that reads the footnotes and marginalia. It predicts exactly where the "cuts" happen in the DNA script, down to the single letter. It can even predict complex "jump cuts" where two distant parts of the script are glued together, which is crucial for understanding diseases.

3. The "Contextualizable Convolution" (The Magic Goggles)
The paper introduces a new technical trick called "contextualizable convolution." Imagine the AI has a pair of smart glasses.

When the AI looks at the DNA, it puts on these glasses.
The glasses change the lens based on the "Director's mood" (the splicing factors).
Suddenly, a letter that looked like a "C" might look like a "G" to the AI because the Director wants it to be read that way.
This allows the AI to be flexible and adapt to any situation instantly, without needing to retrain itself for every new cell type.

The Real-World Impact

Why should you care?

Better Medicine: We can now predict how a specific patient's unique DNA mutation will behave in their specific disease state.
Drug Design: We can design drugs that act like a "Director," telling the cell to cut out the bad scenes (disease-causing proteins) and keep the good ones.
Understanding the Unseen: We can predict what's happening inside cells we can't easily reach (like deep in the brain) by looking at the "Director's mood" in cells we can reach.

In short: PanExonNet is the first AI that stops asking "What is this cell?" and starts asking "What is this cell doing right now?" This allows it to predict the future of our genetic code with a flexibility that was previously impossible.

1. Problem Statement

Alternative splicing is a critical mechanism for generating cell-type-specific protein diversity, yet its dysregulation is linked to neurodegeneration, autoimmunity, and cancer. Current deep learning models for predicting RNA expression from DNA sequence face two major limitations:

Discrete Context Handling: State-of-the-art models (e.g., Borzoi, Pangolin) typically achieve tissue specificity by training separate "heads" or models for predefined, discrete tissue types. This rigid categorical approach prevents learning from pathological states, cell lines, or experimental perturbations that do not fit predefined categories.
Lack of Generalization: These models struggle to generalize to unseen cellular contexts because they treat cell type as a label rather than a continuous input derived from biological state.
Data Limitations: Most models rely on reference genomes, ignoring individual-specific variations (indels, copy number variations) and diploidy, which are crucial for accurate variant effect prediction.

The authors aim to develop a framework that predicts splicing patterns from DNA sequence conditioned on a continuous "splicing state" derived from gene expression, enabling generalization across any cellular context without predefined tissue labels.

2. Methodology: PanExonNet

The authors introduce PanExonNet, a deep learning framework that integrates cis-regulation (DNA sequence) and trans-regulation (splicing factor expression).

Core Architecture

Inputs:
- Sequence: Diploid gene sequences containing individual-specific variants and indels. For cancer cell lines, the model accounts for copy number variations (aneuploidy) by weighting alleles based on local ploidy.
- Context: Expression levels (TPM) of a panel of 277 splicing factors (RNA-binding proteins and spliceosome components).
Contextualizable Convolution:
- Instead of concatenating context embeddings at the end, the model uses Contextualizable ConvNeXt layers.
- A context encoder processes the splicing factor expression vector into a low-dimensional embedding.
- This embedding modulates the weights of the sequence encoder's convolutional layers (depthwise convolution, normalization, pointwise convolution) dynamically. This allows the sequence features to be interpreted differently based on the cellular environment.
Outputs:
- Tracks: Four single-nucleotide resolution tracks: Coverage, Donor Usage, Acceptor Usage, and Intron.
- Junctions: Explicit prediction of donor-acceptor junction usage (a $K \times K$ matrix), not just splice site strength.
- Integration: Predictions from the two alleles are projected to reference coordinates and combined (weighted by copy number for cell lines) to match standard RNA-seq alignment practices.

Training Strategy

Data: Trained on GTEx (healthy tissues) and KD-RNA-seq (knockdown of RNA-binding proteins in cancer cell lines).
Objective: The model predicts a "splicing profile" (sashimi plot) as a coherent object. The loss function is a weighted cosine similarity between predicted and observed profiles, allowing the model to learn relative isoform distributions without needing to predict absolute transcription initiation rates.
Augmentation: TPM values are augmented using Poisson sampling to account for read-count noise.

3. Key Contributions

Pan-Cell-Type Framework: A novel architecture that conditions sequence-based predictions on a continuous "splicing state" derived from trans-acting factors, eliminating the need for discrete tissue-specific heads.
Contextualizable Convolutions: Introduction of a modular layer that dynamically modulates convolutional weights based on input context. This is computationally efficient and biologically interpretable, modeling how the abundance of splicing factors changes the effective importance of sequence motifs.
Individual-Level Genomic Modeling: The model is the first to train on individual-level diploid genomes with indels and copy number variations, rather than relying solely on reference genomes.
Junction Prediction: Unlike most predecessors, PanExonNet explicitly predicts donor-acceptor junction usage, enabling the disambiguation of complex splicing patterns (e.g., mutually exclusive exons).
Generalization via Perturbation: Demonstrated that training on perturbation data (KD-RNA-seq) improves generalization to unseen cell types, proving the model learns causal regulatory logic rather than just memorizing tissue categories.

4. Results

Superior Tissue Specificity: Evaluated on a cassette exon inclusion benchmark using $\Delta$ $Δ$ PSI correlation (deviation from median inclusion). PanExonNet significantly outperformed multi-headed baselines (Borzoi, Pangolin) and matched multi-headed models trained by the authors.
- Metric: PanExonNet achieved a $\Delta$ PSI correlation of ~0.2, compared to ~0.13 for concatenation-based approaches and lower for multi-headed models.
Importance of Split-Read Tracks: Models utilizing tracks derived from split-reads (donor/acceptor usage, intron) outperformed those relying solely on coverage (like Borzoi).
Synergy of Junction Heads: Adding a junction prediction head improved the performance of all other tracks, suggesting a synergistic learning effect rather than interference.
Generalization to Unseen Cell Types: When tested on held-out GTEx tissues and cell lines, PanExonNet maintained high performance. Crucially, models trained jointly with KD-RNA-seq data showed improved generalization to unseen cell types compared to models trained on GTEx alone.
Predictive Reliability: The model exhibits high positive predictive value; while it may miss small deviations (false negatives), large predicted deviations are highly likely to be true. Filtering out low-confidence predictions significantly boosts metric performance.

5. Significance and Future Impact

Clinical Applications: The framework provides a scalable foundation for predicting variant effects in non-reference genomes, designing oligonucleotide therapeutics, and discovering biomarkers across diverse cellular contexts, including diseased tissues where specific cell types are not well-defined.
Causal Modeling: By integrating perturbation data, the model moves toward a causal understanding of splicing regulation, potentially allowing for the prediction of interventions that shift splicing states in specific disease contexts.
Single-Cell Integration: The inferred "splicing state" can serve as a bridge to single-cell RNA-seq data (which typically lacks splicing resolution), enabling the prediction of splicing patterns in single cells based solely on their splicing factor expression.
Modularity: The "Contextualizable ConvNeXt" layer is presented as a generalizable module for any genomic sequence modeling task requiring context specificity, offering a computationally efficient alternative to attention-based context injection.

In summary, PanExonNet represents a paradigm shift from discrete, tissue-specific modeling to a continuous, context-aware framework that successfully generalizes across healthy, diseased, and perturbed cellular states, leveraging individual genomic variation and trans-regulatory expression data.