Integrating morphology and gene expression of neural cells in unpaired single-cell data using GeoAdvAE

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Problem: The "Form vs. Function" Mystery

Imagine you are trying to understand how a car works. You have two huge libraries of information:

Library A: Millions of photos showing the car's shape, size, and how its wheels are turned (its Morphology or "Form").
Library B: Millions of engine logs showing the fuel mix, spark timing, and speed (its Gene Expression or "Function").

The problem? You never have a photo and an engine log from the same car at the same time. You have a giant pile of photos and a separate giant pile of logs.

In biology, this is exactly what happens with brain cells (neurons and microglia). Scientists can take beautiful 3D pictures of a cell's shape, or they can read its genetic "recipe book" (RNA). But doing both for the same cell is incredibly hard, slow, and expensive. So, we have two separate datasets that don't talk to each other.

The big question is: Does a cell's shape tell us what it's doing?

If a cell looks "spiky and angry" (amoeboid), is it fighting an infection?
If it looks "branchy and calm" (ramified), is it just patrolling the neighborhood?

Usually, similar shapes can hide very different internal jobs, and different jobs can look the same from the outside. It's a confusing puzzle.

The Solution: GeoAdvAE (The "Universal Translator")

The authors created a new AI tool called GeoAdvAE. Think of it as a super-smart translator that can take the "Language of Shapes" and the "Language of Genes" and force them to speak the same dialect, even though they've never met.

Here is how it works, using a party analogy:

1. The Two Separate Rooms (The Input)

Imagine a massive party.

Room A is full of people holding pictures of their houses (Morphology).
Room B is full of people holding lists of their favorite hobbies (Gene Expression).
No one knows who is in the other room. They are unpaired.

2. The Goal: A Shared Dance Floor (The Latent Space)

The AI wants to get everyone onto one single dance floor where people with similar "vibes" stand next to each other, regardless of whether they came from Room A or Room B.

If a person in Room A has a "spiky house," they should end up standing next to a person in Room B who has "aggressive hobbies."

3. How the AI Does It (The Three Tricks)

To make this work without cheating, the AI uses three special rules:

The Adversarial Game (The "Blindfolded Judge"):
The AI tries to mix the two groups so well that a "Judge" (a discriminator) cannot tell if a person came from the House Room or the Hobby Room. If the Judge can't tell the difference, the two groups are successfully merged.
The Geometry Rule (The "Gromov-Wasserstein" Trick):
This is the most clever part. Imagine the people in Room A are arranged in a circle based on how similar their houses are. The AI forces the people in Room B to arrange themselves in a similar circle based on their hobbies. It doesn't matter who is next to whom, but the pattern of relationships must stay the same. It preserves the "shape" of the data.
The "Teacher's Hint" (The Prior):
Sometimes, the AI needs a little nudge. The scientists give it a rough map: "Hey, we know that 'Excitatory Neurons' usually look like 'Pyramids'." This isn't a strict rule for every single cell, but it helps orient the whole group so they don't get turned upside down.

The Results: What Did They Find?

The team tested this tool in two ways:

1. The Training Test (Patch-seq Neurons)
They used a rare dataset where they did have matching photos and gene lists for some neurons (the "ground truth").

Result: GeoAdvAE was the best at matching the right photo to the right gene list. It beat all the other existing AI methods. It proved that the "Universal Translator" actually works.

2. The Real Discovery (Alzheimer's Microglia)
They applied the tool to Microglia (the immune cells of the brain) from mice with Alzheimer's disease (5xFAD model).

The Discovery: They found a smooth, one-dimensional line (a continuum) connecting the cells.
- On one end: Calm, Branchy cells (Ramified). These cells were busy with DNA repair (fixing the neighborhood).
- On the other end: Spiky, Blob-like cells (Amoeboid). These cells were busy with Cell Killing (attacking bad neurons).
The Surprise: They found some genes (like Ms4a6b) that changed perfectly with the shape. But they also found that some "Disease" genes (Complement markers) were active without the cell changing its shape.
- Translation: A cell can be screaming "I'm in danger!" internally (genes) while still looking calm on the outside (shape). This means looking at a cell's shape alone isn't enough to know its full story.

Why This Matters

Before this, scientists had to choose: "Do I look at the shape, or do I look at the genes?" They couldn't easily combine them.

GeoAdvAE allows us to:

Connect the dots: We can now predict what a cell is doing just by looking at its shape (or vice versa) in large datasets where we don't have both.
Find new biology: It revealed that some disease processes happen "under the hood" without changing the car's exterior.
Save time and money: We don't need to do the expensive, slow "double-measurement" experiments for every single cell. We can use this AI to infer the connection from the massive amounts of data we already have.

In short: GeoAdvAE is a bridge that finally lets us understand how the "outside" (shape) and the "inside" (genes) of our brain cells work together, helping us solve mysteries like Alzheimer's disease.

1. Problem Statement

The paper addresses a critical gap in systems biology: the inability to simultaneously profile cellular morphology (form) and transcriptomics (function) at single-cell resolution.

The Challenge: While technologies like Patch-seq exist, they are low-throughput. Most available data consists of large, unpaired (diagonal) datasets: massive imaging collections with reconstructed cell shapes and separate, massive single-cell RNA sequencing (scRNA-seq) atlases.
Specific Difficulties in Morphology-GEX Integration:
- Imbalanced Information: Only a small subset of genes (e.g., cytoskeleton, membrane dynamics) directly influences shape, creating a low signal-to-noise ratio and intrinsic asymmetry.
- Lack of Feature Correspondence: Unlike RNA-ATAC or RNA-protein integration, there are no direct feature-to-feature anchors (e.g., a specific gene does not map 1:1 to a specific morphological feature).
- Geometric Complexity: Morphology requires quantitative descriptors that respect geometric relationships; naive embeddings often distort these structures.
Goal: Develop a method to align unpaired morphology and gene expression data into a shared latent space to infer biological relationships without requiring paired measurements.

2. Methodology: GeoAdvAE

The authors propose GeoAdvAE (Geometry-aware Adversarial Autoencoder), a deep learning framework designed for diagonal integration. It learns a shared latent space using four complementary components:

A. Architecture

Modality-Specific VAEs: Two Variational Autoencoders (VAEs) process the data separately.
- Morphology Encoder: Takes CAJAL-quantified morphological vectors (30 dimensions) through two hidden layers.
- Gene Expression Encoder: Takes log-normalized highly variable genes (2000 dimensions) through three layers.
- Both project into a shared latent space ( $d=16$ ).
Discriminator: A 3-layer Multi-Layer Perceptron (MLP) attempts to distinguish whether a latent vector originates from the morphology or gene expression modality. The encoders are trained adversarially to fool the discriminator, forcing the latent spaces to overlap.

B. Loss Function

The total loss ( $L_{total}$ ) is a weighted sum of five terms, trained via a curriculum learning schedule:

Reconstruction Loss ( $L_{recon}$ ): $L_1$ loss (Manhattan distance) to ensure the decoders can reconstruct the original inputs, preserving modality-specific fidelity.
KL Divergence ( $L_{KL}$ ): Regularizes the latent distributions to match a unit-normal prior.
Adversarial Loss ( $L_{GAN}$ ): Encourages the encoders to produce embeddings that the discriminator cannot distinguish by modality, ensuring mixing.
Gromov-Wasserstein Regularization ( $L_{GW}$ ): This is a key innovation. It minimizes the discrepancy between the intra-modality pairwise distances in the two latent spaces. This ensures that the geometric structure (relative distances between cells) is preserved and aligned uniformly across modalities, rather than just matching global distributions.
Prior-Guided Cluster Alignment ( $L_{prior}$ ): A biological prior term that aligns broad, coarse-grained cell clusters (e.g., Excitatory vs. Inhibitory neurons, or Homeostatic vs. Disease-associated microglia). This provides a "semantic orientation" to the latent space, preventing the model from aligning modalities in a biologically meaningless direction.

3. Key Contributions

Novel Framework: GeoAdvAE is the first method specifically designed to handle the asymmetry and lack of feature correspondence between single-cell morphology and transcriptomics.
Geometry-Aware Alignment: By incorporating the Gromov-Wasserstein (GW) loss, the method preserves the geometric relationships of cells within each modality, addressing the issue that morphology is a high-dimensional geometric object.
Biological Priors: The integration of coarse cluster priors ( $P$ ) allows the model to orient the latent space correctly even without paired data, a crucial step for biological interpretability.
Scalability: The method is designed to handle large-scale unpaired datasets (tens of thousands of cells), unlike Patch-seq which is limited to hundreds.

4. Results

A. Validation on Simulated Data

Ablation Study: Removing any component (Adversarial, GW, Prior, or Reconstruction) led to distinct failures: modalities remained separated, geometric coherence broke, or clusters were misoriented.
Benchmarking: GeoAdvAE outperformed state-of-the-art methods (SCOT, UnionCom, scJoint, Crossmodal-AE) in cross-modal cell-type matching accuracy. Optimal transport methods failed to find the correct orientation, while latent alignment methods struggled with fine-grained structures.

B. Validation on Patch-seq Neurons

Dataset: 645 neurons from the mouse motor cortex with paired GEX and morphology measurements (ground truth).
Performance: GeoAdvAE achieved 34% cross-modal cell-type matching accuracy, significantly outperforming graph-based baselines (ScDART: 28%, STACI: 27%) and other methods.
Biological Interpretation: Using Integrated Gradients, the model identified genes associated with morphological transitions. It successfully highlighted pathways known to regulate neuronal shape, such as axon guidance (e.g., Slit3, Rasgrp1) and Rho-family GTPases (actin remodeling), validating that the latent space captures true biological mechanisms.

C. Application to 5xFAD Microglia (Alzheimer's Disease)

Dataset: 98 microglial morphologies (CAJAL-quantified) and 31,948 scRNA-seq profiles from the 5xFAD mouse model.
Discovery:
- 1D Manifold: The integration revealed a continuous 1-dimensional axis spanning from ramified (homeostatic) to amoeboid (disease-associated) states.
- Gene-Morphology Links:
  - Ramified: Associated with DNA repair genes (tissue maintenance).
  - Amoeboid: Associated with cell killing genes (phagocytic/neurotoxic).
  - Novel Markers: Identified Ms4a6b (linked to ramified forms) and iron-loading genes Ftl1/Fth1 (linked to dystrophic states).
- Decoupled Signatures: Crucially, the model found that Complement/DAM signatures (e.g., C1qa, C3) were not correlated with the morphological axis. This suggests that complement activation is an upstream or orthogonal transcriptomic program that can occur without visible changes in cell shape, challenging the assumption that morphology is a perfect proxy for all functional states.

5. Significance

Bridging Form and Function: GeoAdvAE provides a scalable solution to connect cellular "form" (morphology) and "function" (transcriptomics) when joint profiling is impractical.
Mechanistic Insights: It moves beyond simple correlation to identify specific gene programs driving morphological transitions (e.g., DNA repair vs. cell killing) and, conversely, identifies functional states that are invisible to morphology (e.g., complement activation).
Disease Relevance: In the context of Alzheimer's disease, the method helps dissect the continuum of microglial states, distinguishing between protective homeostatic mechanisms and pathological neurotoxicity, offering new targets for therapeutic intervention.
Generalizability: The framework is applicable to other biological systems where multi-modal data exists but is unpaired, provided coarse biological priors can be defined.

Availability: The code is publicly available at https://github.com/turbodu222/GeoAdVAE.