Predicting Unseen Gene Perturbation Response Using Graph Neural Networks with Biological Priors

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a massive mystery inside a city called The Cell. This city is bustling with millions of citizens (genes) who constantly talk to each other, send signals, and work together to keep the city running.

Sometimes, scientists want to see what happens if they "knock out" or mess with a specific citizen (a gene) to see how the whole city reacts. This is like pulling a specific thread in a sweater to see how the whole fabric changes.

The Problem:
There are thousands of citizens in this city. It is impossible, too expensive, and takes too long to test every single one of them in a real lab. We can only test a few hundred. But what if we want to know what happens if we mess with the 5,000th gene? We can't just wait for the lab experiment. We need a way to predict the outcome without actually doing the experiment.

The Solution: PerturbGraph
The authors of this paper built a super-smart computer program called PerturbGraph. Think of it as a crystal ball powered by a map of the city's relationships.

Here is how it works, broken down into simple analogies:

1. The Map (The Biological Network)

Imagine you have a giant map of the city showing who talks to whom. In biology, this is called a Protein-Protein Interaction Network.

The Analogy: If Gene A is the "Mayor" and Gene B is the "Chief of Police," they talk a lot. If Gene C is a "Street Sweeper," they might not talk to the Mayor directly, but they talk to the Police Chief.
How the AI uses it: The AI knows that if you mess with the Mayor, the Police Chief will definitely react. But if you mess with the Street Sweeper, the Mayor might not notice immediately. The AI uses this map to guess how the reaction spreads.

2. The "Gossip" (Message Passing)

The core of the AI is a Graph Neural Network.

The Analogy: Imagine a game of "Telephone." If you whisper a secret to your neighbor, they tell their neighbor, and so on.
How the AI uses it: When the AI tries to predict what happens to a gene it has never seen before (an "unseen" gene), it doesn't guess in a vacuum. It looks at the gene's neighbors on the map. It asks, "What happened to your friends when they were messed with?" It gathers all that "gossip" from the network and uses it to figure out the likely reaction of the new gene.

3. The "Resume" (Biological Priors)

The AI doesn't just look at the map; it also looks at the gene's "resume."

The Analogy: Before making a prediction, the AI checks the gene's background:
- What is its job? (Gene Ontology: Is it a builder? A cleaner? A messenger?)
- How loud is it usually? (Baseline statistics: Is it a quiet gene or a loud one?)
- Who are its friends? (Network embeddings: Is it popular or a loner?)
How the AI uses it: By combining the map (who they know) with the resume (who they are), the AI makes a much smarter guess than if it just looked at the gene in isolation.

4. The Prediction (The Crystal Ball)

The goal is to predict the Transcriptional Response.

The Analogy: If you pull the Mayor's thread, the AI predicts exactly how the city's noise level, traffic, and mood will change. It predicts which other citizens will start shouting (up-regulated) and which will go silent (down-regulated).

Why is this a Big Deal?

The paper tested this AI against other methods:

Old School Math: Like guessing based on simple averages. (The AI crushed them).
Deep Learning without Maps: Like a smart student who knows the facts but doesn't know who knows whom. (The AI did better because it used the map).
Other AI Models: The AI was the best at predicting what happens to genes it had never seen before.

The Results:

It was 6% more accurate than the next best model at guessing the overall reaction.
It was 20% more accurate than simple linear models.
It successfully predicted which specific genes would turn on or off, helping scientists find new drug targets or understand diseases without running thousands of expensive lab tests.

In a Nutshell

PerturbGraph is like a detective who has memorized the entire social network of a city. If you ask, "What happens if we arrest this one guy we've never met?" the detective doesn't guess randomly. They look at his friends, his job, and his history, and say, "Well, his best friend is the Police Chief, so the Chief will be furious, and the whole police department will shut down."

This allows scientists to simulate millions of experiments on a computer, saving time, money, and helping us understand how life works much faster.

1. Problem Statement

The central challenge addressed is predicting transcriptional responses to genetic perturbations that have not been experimentally observed.

Context: While CRISPR Perturb-seq technologies allow for high-resolution measurement of gene expression changes, it is infeasible to experimentally test every possible gene perturbation due to cost and biological constraints.
Gap: Existing computational models often struggle to generalize to "unseen" perturbations (genes not present in the training set). Many current approaches focus on generating cell-level profiles or rely on feature-based mappings without fully exploiting the complex propagation of perturbation effects through biological interaction networks (e.g., protein-protein interactions).
Goal: Develop a scalable framework that can infer the transcriptional shift (differential expression) of a gene when perturbed, based solely on its biological context and relationships with other genes, even if that specific gene was never perturbed during training.

2. Methodology: PerturbGraph

The authors propose PerturbGraph, a biologically informed graph-learning framework. The workflow consists of four main stages:

A. Data Preprocessing & Signature Construction

Input: Single-cell CRISPR perturbation data (e.g., Replogle and Norman datasets).
Signature Generation: Single-cell measurements are aggregated into "pseudo-bulk" profiles. A perturbation signature ( $\Delta_i$ ) is calculated as the difference between the mean expression of perturbed cells and control cells: $\Delta_i = x^{pert}_i - x^{ctrl}$ .
Latent Space Projection: To handle high dimensionality and noise, signatures are projected into a low-dimensional latent space ( $K$ dimensions) using Truncated Singular Value Decomposition (SVD). This yields latent perturbation programs ( $H$ ) that capture stable transcriptional variations.

B. Biological Graph Construction

Graph Structure ( $G$ ): A biological interaction graph is constructed where nodes represent genes and edges represent functional relationships derived from the STRING protein-protein interaction (PPI) database.
Node Features ( $Z$ ): Each gene node is enriched with a multi-source feature vector combining:
1. Network Embeddings: Structural features learned via Node2Vec.
2. Baseline Statistics: Transcriptional characteristics (mean, variance) from control cells.
3. Functional Annotations: Gene Ontology (GO) embeddings.

C. Graph Neural Network (GNN) Architecture

Model: A Graph Convolutional Network (GCN) is employed to propagate information across the interaction graph.
Mechanism: The GCN uses message passing to aggregate information from neighboring genes. The update rule is:
$H^{(l+1)} = \sigma(\hat{A}H^{(l)}W^{(l)})$
Where $\hat{A}$ is the normalized adjacency matrix, $W$ are learnable weights, and $\sigma$ is a non-linear activation.
Prediction: The final node embeddings are decoded to reconstruct the predicted latent perturbation program ( $\hat{h}_i$ ), which is then mapped back to the transcriptional space to predict the expression shift ( $\hat{\Delta}_i = \hat{h}_i V$ ).

D. Evaluation Protocol

Strict Unseen Setting: The dataset is split into training, validation, and test sets based on perturbation genes. Crucially, test genes are disjoint from training genes (the model has never seen the perturbation of these specific genes).
Metrics: Cosine similarity (global agreement), Spearman rank correlation (gene ranking), Directional Accuracy (up/down regulation correctness), and Precision@k (recovery of top differentially expressed genes).

3. Key Contributions

Novel Framework: Introduction of PerturbGraph, the first framework to explicitly model perturbation effects as stable transcriptional programs propagated through a biologically enriched interaction graph.
Enriched Representation: Development of a composite node feature set integrating PPI topology, graph embeddings, baseline transcriptional statistics, and GO functional annotations.
Superior Generalization: Demonstration that propagating information through biological networks significantly outperforms models that rely solely on gene features or cell-level generative modeling.
Comprehensive Benchmarking: A rigorous evaluation against a wide range of baselines, including linear models (Ridge, Lasso), nonlinear models (Random Forest, MLP), state-of-the-art perturbation models (scGen, CPA), and alternative GNNs (GraphSAGE, GAT).

4. Results

The model was evaluated on large-scale CRISPR datasets (Replogle and Norman) under strict unseen-perturbation settings.

Performance vs. Baselines:
- Replogle Dataset: PerturbGraph achieved a Cosine Similarity of 0.592 and Spearman Correlation of 0.340.
- It outperformed the strongest feature-based baseline (Random Forest, Cosine 0.557) by ~6% and perturbation-specific deep learning models like CPA (0.567) and scGen (0.557).
- It also outperformed other GNN architectures (GraphSAGE: 0.563, GAT: 0.551), suggesting that standard GCN aggregation is particularly effective for this task.
Generalization (Norman Dataset):
- PerturbGraph achieved a Cosine Similarity of 0.940 and Spearman Correlation of 0.815, surpassing Ridge regression (0.901) and CPA (0.918).
Biological Priors Impact:
- Ablation studies (Table 3) showed that while graph topology alone is effective, adding GO embeddings provided the most significant boost, confirming that functional annotations are critical for predicting unseen gene behaviors.
Interpretability:
- The model successfully recovered biologically coherent pathways (e.g., translation, rRNA processing for ribosomal gene perturbations).
- Prediction accuracy was found to correlate with network proximity: genes closer to training perturbations in the PPI network yielded higher prediction accuracy.

5. Significance

Scalable Functional Discovery: PerturbGraph enables in silico screening of candidate perturbations, guiding experimental design and accelerating the discovery of regulatory dependencies and therapeutic targets without the need for exhaustive wet-lab experiments.
Biological Inductive Bias: The study validates that integrating biological interaction networks (PPI) with graph representation learning provides a strong inductive bias, allowing models to learn the "rules" of transcriptional regulation that generalize to unseen genes.
Methodological Advancement: It establishes that modeling gene-gene relationships via message passing is superior to treating genes as independent features or relying solely on cell-state generative models for perturbation prediction tasks.

Limitations & Future Work: The current framework focuses on perturbation-level programs rather than cell-type heterogeneity. Future iterations aim to integrate cell-type-specific networks and generative single-cell models to further enhance biological interpretability.