Kernel spectral joint embeddings for high-dimensional noisy datasets using duo-landmark integral operators

Imagine you are a detective trying to solve a mystery, but you have two different sets of clues.

The Problem:
You have two notebooks filled with data.

Notebook A comes from a noisy, chaotic street market. It has thousands of entries, but many are scribbles, typos, or irrelevant chatter.
Notebook B comes from a quiet library. It also has thousands of entries, but they are written in a different handwriting, and while some are clear, others are faded or smudged.

Both notebooks are trying to tell you the same story (the "signal"), but they are buried under different types of noise. The old way of solving this was to either:

Look at them separately (and miss the big picture).
Glue them together into one giant, messy book (and get confused by the contradictions).
Try to force them to match perfectly (and end up with a fake story).

The Solution: The "Duo-Landmark" Method
The authors of this paper, Xiucai Ding and Rong Ma, invented a new way to read these notebooks together. They call it Kernel Spectral Joint Embeddings using Duo-Landmark Integral Operators. That's a mouthful, so let's break it down with a simple analogy.

The Analogy: The Two Tour Guides

Imagine you are in a massive, foggy city (the High-Dimensional Data). You want to find the hidden parks and landmarks (the Signal), but the fog is so thick you can barely see your hand in front of your face (the Noise).

You have two tour guides:

Guide A knows the city well but is shouting over a loud construction site (Noisy Dataset 1).
Guide B is whispering in a quiet library but is slightly lost and confused (Noisy Dataset 2).

The Old Way:
You'd ask Guide A for directions, then ask Guide B, then try to draw a map based on both. But because the fog is so thick, you might draw a map that leads you into a wall, or you might ignore the good parts of Guide B because Guide A is so loud.

The New "Duo-Landmark" Way:
Instead of treating them as separate voices, the authors propose a clever trick: They make the guides talk to each other.

The "Landmark" Concept:
Imagine Guide A is standing on a hill. Guide B is standing in a valley. They can't see each other clearly because of the fog.
The method asks: "If Guide A points to a landmark, where would Guide B point to the same thing?"
It creates a bridge between the two. It doesn't just look at Guide A's map; it uses Guide B to "clean up" Guide A's map, and vice versa.
The "Duo-Landmark" Operator (The Magic Bridge):
Think of this as a special translator.
- It takes a point from Notebook A.
- It asks, "Who in Notebook B looks like this?"
- It takes that connection and uses it to figure out the true shape of the city.
- Crucially, it doesn't force the two notebooks to be identical. It only connects the parts that actually match. If Notebook A talks about "Apples" and Notebook B talks about "Oranges," the bridge stays silent. It only builds a bridge where the fruit is the same.
The "Spectral" Part (The X-Ray Vision):
Once the bridge is built, the method uses math (specifically, something called "Spectral Analysis") to look through the fog.
- Imagine the noise is static on a TV screen.
- The "Spectral" method tunes the radio to the specific frequency where the two guides are singing in harmony.
- Suddenly, the static disappears, and you see a crystal-clear, low-resolution map of the city's hidden parks. This is the Joint Embedding.

Why is this a Big Deal?

1. It handles "Messy" Data:
Real-world data (like genetic codes from cells) is incredibly noisy. Old methods often give up or produce garbage when the noise is too high. This method is like a noise-canceling headphone that works even if one ear is plugged with cotton and the other is in a windstorm. It uses the "good" parts of one dataset to fix the "bad" parts of the other.

2. It's Flexible:
Sometimes one dataset has 1,000 samples and the other has 10,000. Old methods get confused by this imbalance. This method says, "No problem! We'll just use the smaller group as a 'landmark' to help the bigger group, and the bigger group will help the smaller one."

3. It Knows When to Stop:
The method has a built-in "lie detector." Before it starts building the bridge, it checks: "Do these two datasets actually share any secrets?"
If you try to force a connection between a dataset about "Cooking" and a dataset about "Astrophysics," the method will say, "Nope, these don't match," and stop you from creating a fake map. This prevents scientists from drawing false conclusions.

The Real-World Impact

The authors tested this on Single-Cell Omics data (imagine trying to identify different types of cells in a drop of blood).

Before: Scientists would look at two different blood samples and struggle to see if the cell types matched because the "noise" (experimental errors) was so high.
After: Using this method, they could merge the two samples and instantly see the distinct cell types clearly, even if one sample was much noisier than the other.

Summary

Think of this paper as inventing a super-powered pair of binoculars.

One lens is blurry (Dataset A).
The other lens is cracked (Dataset B).
By looking through both at the same time and using a special mathematical "bridge" (the Duo-Landmark Operator) to combine their views, you can see the landscape clearly, ignoring the blur and the cracks.

It allows scientists to combine different, messy sources of information to reveal the beautiful, hidden structures of the world that were previously invisible.

1. Problem Statement

The paper addresses the challenge of integrative analysis for two independently observed, high-dimensional, and noisy datasets, denoted as $\mathcal{X} = \{x_i\}_{i=1}^{n_1} \subset \mathbb{R}^p$ and $\mathcal{Y} = \{y_j\}_{j=1}^{n_2} \subset \mathbb{R}^p$ .

Key Characteristics of the Problem:

Independence: Unlike multi-view learning (where datasets share the same samples but different features), these datasets share the same feature space ( $p$ ) but have independent samples ( $n_1 \neq n_2$ ).
Partial Overlap: The datasets may share only partial underlying signal structures (e.g., shared biological pathways in single-cell omics) rather than identical manifolds.
High-Dimensionality & Noise: The data is subject to high-dimensional noise ( $p$ can be comparable to or larger than $n$ ) and varying signal-to-noise ratios (SNR) between datasets.
Limitations of Existing Methods: Current approaches often lack theoretical guarantees in high dimensions, fail to adapt to imbalanced sample sizes or varying SNRs, and struggle to distinguish shared structures from dataset-specific noise or artifacts.

Goal: To learn a joint low-dimensional embedding for both datasets that captures their shared nonlinear structures, enabling downstream tasks like simultaneous clustering, denoising, and visualization.

2. Methodology: Kernel Spectral Joint Embeddings (KSJE)

The authors propose a novel algorithm (Algorithm 1) based on asymmetric kernel matrices and duo-landmark integral operators.

A. Algorithm Overview

Alignability Screening (Step 1):
- Before integration, the algorithm checks if the datasets share meaningful structures. It constructs a full symmetric kernel matrix on the merged data and evaluates the "purity" of local neighborhoods.
- If the datasets are not alignable (i.e., they share no common structure), the algorithm stops to prevent artificial alignment (distortion).
Duo-Landmark Kernel Matrix Construction (Step 2):
- Instead of merging data or using symmetric self-connections, the method constructs an asymmetric rectangular kernel matrix $K \in \mathbb{R}^{n_1 \times n_2}$ .
- Entries are defined as $K(i, j) = \exp(-\|x_i - y_j\|^2 / h_n)$ , using only cross-dataset distances.
- Adaptive Bandwidth: The bandwidth $h_n$ is selected adaptively based on the empirical distribution of cross-dataset distances (specifically the $\omega$ -percentile), ensuring robustness to unknown nonlinear structures and SNR levels.
Spectral Decomposition (Step 3):
- Perform Singular Value Decomposition (SVD) on the scaled matrix $(n_1 n_2)^{-1/2} K$ .
- The joint embeddings are derived from the leading left and right singular vectors, weighted by singular values and sample sizes.

B. Theoretical Framework: Duo-Landmark Integral Operators

The core theoretical innovation is the definition of Duo-Landmark Integral Operators ( $\mathcal{K}_1, \mathcal{K}_2$ ).

Joint Manifold Model: The clean signals are assumed to lie on two Riemannian manifolds ( $M_1, M_2$ ) that share a common low-dimensional substructure.
Convolutional Landmark Kernels: The method defines new kernels $k_1$ and $k_2$ via convolution. For example, $k_1(x_1, x_2)$ integrates the standard kernel over the other dataset's manifold ( $M_2$ ) acting as a "landmark."
$k_1(x_1, x_2) = \int_{S_2} k(x_1, z)k(z, x_2) d\mathbb{P}_2(z)$
Operator Properties: These operators map functions on one manifold to functions on the same manifold but incorporate information from the other. Crucially, $\mathcal{K}_1$ and $\mathcal{K}_2$ share the same non-zero eigenvalues, justifying the use of the asymmetric matrix's singular values to estimate the spectral properties of both datasets simultaneously.

3. Key Contributions

Novel Theoretical Framework:
- Introduces the Joint Manifold Model and Duo-Landmark Integral Operators, providing a rigorous mathematical foundation for integrating independent datasets with partial overlap.
- Proves that the empirical singular vectors of the asymmetric kernel matrix converge to the eigenfunctions of these integral operators.
Robustness to High-Dimensional Noise:
- Establishes convergence rates under high-dimensional noise ( $p \to \infty$ ).
- Demonstrates that the method is robust even when one dataset has a low SNR, provided the combined signal strength dominates the noise.
- Identifies a phase transition: When noise dominates, the spectrum follows the free multiplicative convolution of two Marchenko-Pastur laws, allowing the method to detect when integration is futile.
Handling Imbalance and Heterogeneity:
- Theoretical guarantees hold regardless of the ratio between sample sizes ( $n_1$ vs $n_2$ ) or feature dimensions ( $p$ ).
- The adaptive bandwidth selection allows the method to handle datasets with vastly different noise levels.
Practical Screening Mechanism:
- Includes a built-in alignability screening step to prevent the integration of unrelated datasets, a common failure mode in existing methods.

4. Results

Theoretical Results

Spectral Convergence: Theorem 3.8 shows that for clean signals, the empirical eigenvalues/vectors converge to the theoretical operators at a rate of $O(n^{-1/2})$ .
Noise Robustness: Theorem 3.12 proves that under high-dimensional noise, the algorithm remains consistent if the signal-to-noise ratio satisfies specific conditions ( $\sum \theta_i \gg p\sigma^2$ ).
Phase Transition: Theorem 3.14 characterizes the behavior when noise dominates, showing the spectrum converges to a deterministic limit derived from random matrix theory (free multiplicative convolution), distinct from the signal-driven regime.

Empirical Results

Simulations:
- Clustering: Outperformed PCA, Kernel PCA, Joint PCA, and other integration methods (like Roseland and LBDM) in simultaneous clustering tasks, especially when datasets had partial overlap or imbalanced noise.
- Manifold Learning: Successfully recovered low-dimensional structures (e.g., torus shapes) in noisy data by leveraging a cleaner external dataset, achieving higher Jaccard indices for neighborhood preservation.
Real-World Applications (Single-Cell Omics):
- Applied to scRNA-seq (human PBMCs) and scATAC-seq (mouse brain) datasets from different studies/conditions.
- Performance: Achieved superior clustering accuracy (Rand Index) compared to Seurat integration and other baselines.
- Robustness: Showed less variability across different embedding dimensions ( $r$ ) compared to competitors, indicating greater stability.

5. Significance and Impact

Bridging Theory and Practice: The paper provides the first rigorous theoretical justification for joint embedding of independent, high-dimensional noisy datasets, moving beyond heuristic approaches common in bioinformatics.
Handling "Real" Data Complexity: By explicitly modeling partial overlap, sample imbalance, and varying SNRs, the method addresses critical bottlenecks in modern data integration (e.g., merging datasets from different labs or technologies).
Artifact Prevention: The alignability screening is a crucial contribution, preventing the "hallucination" of shared structures where none exist, a significant risk in unsupervised integration.
Scalability: The algorithm scales as $O(n^2 p)$ , comparable to Kernel PCA, making it feasible for large-scale single-cell datasets (thousands to tens of thousands of cells).

In summary, Ding and Ma propose a mathematically grounded, robust, and practical framework for integrating heterogeneous high-dimensional data, leveraging the concept of "landmark" datasets to mutually enhance the representation of shared nonlinear structures.