Inferring large networks with matrix factorisation to… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Problem: The "Noisy Library"

Imagine you walk into a massive library with millions of books (genes). You want to figure out which books influence which others. For example, does Book A tell Book B to open its pages?

In the past, scientists looked at the library as a whole (like taking a photo of the whole room). But this is like looking at a blurry crowd; you can't see who is talking to whom because everyone is mixed together.

To get a better look, scientists started looking at individual people (single cells) instead of the whole crowd. This is great because it reveals the unique conversations happening in each person. However, there's a catch: The data is incredibly sparse.

Think of it like this: You have a library with 20,000 books, but for any single person, only 50 books are actually open. The rest are closed and dark. If you try to map the relationships between all 20,000 books based on just 50 open ones, it's like trying to solve a giant puzzle with 99% of the pieces missing. Old methods (like GENIE3 or GRNBoost2) try to guess the missing pieces by looking at patterns, but when the data is this sparse and noisy, they get confused, make mistakes, and take a very long time to compute.

The Solution: NIRD (The "Smart Summarizer")

The authors propose a new method called NIRD (Network Inference in Reduced Dimension).

The Analogy: The "Abstract Art" Approach
Instead of trying to read every single word in every book (which is impossible with missing data), NIRD does something clever:

Summarize the Room (Matrix Factorization): Imagine you take a photo of the library and compress it into a few "abstract art" images. These images capture the vibe of the room without needing every single detail. In math terms, they reduce the massive data into a smaller, cleaner set of "basis vectors" (the abstract images).
Learn the Rules (Tree Ensembles): Now, instead of guessing how 20,000 books talk to each other, the computer learns how these few "abstract images" influence the books. It uses a smart decision-tree system (like a flowchart) to figure out which "vibe" causes a specific book to open.
Project Back (The Magic Trick): Once the computer understands how the "vibes" control the books, it translates that knowledge back to the original 20,000 books. It can now say, "Ah, Book A influences Book B because they both respond to the same 'vibe'."

Why is this better?

Noise Reduction: By summarizing the data first, NIRD filters out the static and noise (the "closed books") that confuse other methods.
Speed: It's much faster because it's solving a smaller puzzle first.
Consistency: Even if you take the library photo on a rainy day (batch effect) or a sunny day, the "abstract art" looks the same, so the relationships you find remain consistent.

Real-World Tests: Does it Work?

The authors tested NIRD in three scenarios:

The "Gold Standard" Test: They used known bacterial and yeast networks where the answers were already known. NIRD found the correct connections faster and more accurately than the old methods.
The "Osteoarthritis" Detective: They looked at cells from people with knee arthritis (OA) vs. healthy people.
- The Old Way: The methods got confused by the noise and couldn't agree on which genes were important.
- The NIRD Way: It consistently found specific "villains" (genes like ZNF207 and MAX) that were driving the inflammation in arthritis. It even found new clues about how the body tries to heal wounds but gets stuck in the "inflammation phase."
The "Time Travel" Test (RNA Velocity):
- The Concept: Standard data is a photo (static). RNA Velocity is like a video; it shows which way a cell is moving (e.g., is it becoming a muscle cell or a skin cell?).
- The Result: When NIRD combined the "photo" (gene expression) with the "video" (RNA velocity), it became a super-detective. It could predict exactly which genes a "boss" gene (Transcription Factor ZIC3) was directly controlling to make stem cells change. Other methods just guessed randomly.

The Takeaway

Think of NIRD as a smart translator.
Old methods try to translate a conversation by listening to every single word in a noisy room, getting lost in the static. NIRD first listens to the tone and rhythm of the room (the reduced dimension), figures out the main message, and then translates that back to the specific words.

This allows scientists to:

Build accurate maps of how genes talk to each other, even with messy data.
Find the specific genes causing diseases like arthritis.
Predict how cells will change in the future, helping us understand development and disease better.

In short, NIRD turns a chaotic, blurry picture of a cell into a clear, sharp map of its internal wiring.

1. Problem Statement

Inferring gene regulatory networks (GRNs) from single-cell RNA sequencing (scRNA-seq) data presents two major challenges:

Sparsity and Noise: scRNA-seq data is highly sparse (many zero counts) and noisy, which destabilizes traditional network inference methods.
Scalability and Non-linearity: Existing state-of-the-art methods like GENIE3 and GRNBoost2 (based on tree ensembles) struggle with computational complexity when scaling to large networks (>5,000 genes). Furthermore, while these methods capture non-linear dependencies, they often fail to maintain consistency across different batches or technical protocols due to the high dimensionality of the data.
Causality: Standard correlation or association-based methods often fail to distinguish between direct causal regulation and indirect co-expression ("guilt by association"), especially in the absence of time-series data.

2. Methodology: Network Inference in Reduced Dimension (NIRD)

The authors propose NIRD, a framework that combines matrix factorization with non-linear regression to infer dependencies efficiently. The workflow consists of three main stages:

A. Dimension Reduction via Matrix Factorization

Instead of modeling gene $j$ directly against all other genes $i$ (which is computationally expensive and sensitive to noise), NIRD first reduces the dimensionality of the gene expression matrix $X$ ( $m$ cells $\times$ $n$ genes).

The matrix is factorized into $X \approx AY$ $X \approx A Y$ , where:
- $A$ ( $m \times k$ ) contains the basis vectors (latent factors) representing the reduced dimension space.
- $Y$ ( $k \times n$ ) contains the loadings of each gene on these basis vectors.
The authors tested 14 different factorization methods, including PCA, SVD, NMF, SepNMF, PMF, and others. PCA and SepNMF showed the most consistent performance.

B. Non-Linear Regression on Basis Vectors

For each target gene $j$ , the method models its expression using the basis vectors (columns of $A$ ) rather than the raw expression of other genes.

Model: $x_j = f(a_1, a_2, ..., a_k) + \epsilon_j$
Algorithm: A tree-based ensemble method (Random Forest or Extra Trees) is used as the function $f()$ to capture non-linear relationships.
Feature Importance: The model calculates the importance ( $F_i^j$ ) of each basis vector $i$ in predicting gene $j$ .

C. Back-Projection to Gene-Gene Dependencies

The feature importances of the basis vectors are projected back to the original gene space to estimate the regulatory strength between genes.

The weight of the edge between gene $l$ (regulator) and gene $j$ (target) is calculated as:
$w_{lj} = \frac{\sum_{i=1}^{k} F_i^j Y_{il}}{\sum_{i=1}^{k} Y_{il}}$
This effectively distributes the predictive power of the latent factors back to the specific genes that contributed to those factors.

D. Extension to RNA Velocity (NIRD-expr+velo)

To infer causality, the framework was extended to incorporate RNA velocity.

The reduced basis vectors ( $A$ ) derived from gene expression are used to predict the RNA velocity ( $r_j$ ) of target genes.
By modeling how gene expression states influence the future state (velocity) of a gene, the method can better distinguish direct regulators from indirect correlations.

3. Key Contributions

Algorithmic Innovation: Introduced NIRD, which decouples dimensionality reduction from network inference, allowing for efficient handling of sparse, high-dimensional scRNA-seq data.
Robustness to Batch Effects: Demonstrated that NIRD produces highly consistent networks across different technical protocols (e.g., SMART-seq vs. Drop-seq) and batch effects, outperforming tree-based baselines.
Integration of RNA Velocity: Successfully adapted the framework to use RNA velocity for inferring non-linear causality, improving the identification of direct transcription factor (TF) targets.
Computational Efficiency: Achieved comparable or superior accuracy to GENIE3 and GRNBoost2 with significantly lower computational time, making it scalable for large networks.

4. Key Results

Benchmarking on Gold-Standard Datasets

DREAM5 Challenge: On E. coli, S. aureus, and S. cerevisiae datasets, NIRD (specifically PCA-based) matched or outperformed GENIE3, GRNBoost2, ARACNE, and RELNET in Area Under the Curve (AUC) metrics while requiring less computation time.
Mouse Embryonic Stem Cells (mESC):
- Using a literature-curated gold set, NIRD (PCA) achieved the highest AUC.
- Consistency Test: When inferring networks from two different scRNA-seq protocols (SMART-seq and Drop-seq), NIRD showed significantly higher overlap (consistency) than GENIE3 and GRNBoost2, which were found to be unstable and sensitive to noise.

Biological Applications

Osteoarthritis (OA) Analysis: Applied to articular cartilage scRNA-seq data (Normal vs. OA patients).
- NIRD identified cell-type-specific regulators (HTC and preHTC cells) that other methods missed.
- Findings: Identified NFATC2, ZNF207, KDM2A, and KLF3 as key TFs in OA-affected HTC cells. In preHTC cells, ZBTB10, ZBTB21, ZNF580, and MAX were highlighted.
- Pathway analysis revealed enrichment in "Burn Wound Healing" and inflammatory responses, supporting the concept of OA as a stalled healing process.
Human Embryonic Stem Cells (hESC) & Causality:
- Compared NIRD-expr+velo (with velocity) vs. NIRD-expr-only.
- ZIC3 Validation: Using ChIP-seq and knockdown data for the TF ZIC3, NIRD-expr+velo showed a 1.4x improvement in AUC over random models and significantly better coverage of direct targets compared to NIRD-expr-only or correlation-based methods.
- Functional Insights: Differentiated the roles of ZIC2 (apoptosis), ZIC3, and ZIC5 (homology-directed repair) based on their target gene enrichment.

5. Significance and Conclusion

The paper establishes that matrix factorization is a critical preprocessing step for inferring regulatory networks from sparse single-cell data. By reducing dimensionality before applying non-linear regression, NIRD mitigates the "curse of dimensionality" and noise that plague standard tree-based methods.

Reliability: NIRD provides consistent results across batches, a crucial requirement for comparing disease states (e.g., Normal vs. OA).
Causality: The integration with RNA velocity offers a novel pathway to infer direct causal relationships without requiring time-series experiments.
Scalability: The method offers a computationally efficient alternative to current heavy-weight ensemble methods, enabling the inference of large-scale regulatory networks.

The authors conclude that NIRD is a robust, flexible framework for uncovering non-linear gene dependencies, identifying disease-specific regulators, and elucidating transcriptional causality in complex biological systems. The code is made available via GitHub to facilitate adoption.

Inferring large networks with matrix factorisation to capture non-linear dependencies among genes using sparse single-cell profiles