Spectral Graph Filtering for Modality-Specific Representation Learning

Imagine you are trying to understand a complex story, but you only have two different cameras filming the same scene.

Camera A sees a Bulldog and a Yoda doll spinning on a turntable.
Camera B sees the same Bulldog spinning, but this time it's paired with a Rabbit doll.

Both cameras see the Bulldog spinning. That's the shared story (the common thread). But Camera A sees the Yoda spinning, and Camera B sees the Rabbit spinning. Those are the unique stories specific to each camera.

Most data analysis tools today are like detectives who only care about the shared story. They try to merge the two videos to figure out how fast the Bulldog is spinning, and they often ignore or "blur out" the Yoda and the Rabbit because they think those are just noise or distractions.

This paper introduces a new tool called DELVE (Differential Latent Variables Extraction). Instead of ignoring the unique parts, DELVE is designed specifically to find and highlight them. It asks: "What is happening in Camera A that Camera B doesn't see? And vice versa?"

How Does It Work? (The "Noise-Canceling" Headphones Analogy)

Think of the data from each camera as a song playing on a radio.

The Shared Song (the Bulldog) is playing loudly on both radios.
The Unique Songs (Yoda and Rabbit) are playing quietly in the background of only one radio each.

If you just listen to the radios, the loud shared song drowns out the quiet unique songs.

DELVE acts like a pair of high-tech, noise-canceling headphones:

It listens to Radio A to learn exactly what the "Shared Song" sounds like.
It then tunes Radio B and uses that knowledge to cancel out the Shared Song.
Suddenly, the quiet Rabbit song (the unique part) becomes crystal clear.

In technical terms, the authors build a "map" (a graph) of how the data points connect to each other for both cameras. They then use a mathematical filter to subtract the connections that look the same in both maps, leaving behind only the connections that are unique to one map.

Why Do We Need This?

In the real world, ignoring the "unique" parts can be a disaster.

In Medicine: Imagine studying cells. Two types of tests (Gene A and Gene B) might show that a group of cells looks the same. But if you only look at the "shared" result, you miss the fact that Gene B reveals a hidden, dangerous subtype of the cell that Gene A missed. DELVE finds that hidden danger.
In Robotics: A robot might have a camera and a microphone. The camera sees a door opening (shared with the sound of the motor), but the microphone hears a specific squeak that tells the robot the door is broken. DELVE helps the robot hear the squeak by ignoring the motor noise.

The "Magic" of the Method

The authors didn't just guess this would work; they proved it mathematically. They showed that if you have enough data, their method is guaranteed to find these hidden, unique patterns, even if they are very subtle.

They tested DELVE on:

Toy Examples: Like the spinning dolls and geometric shapes, where they knew the answer beforehand. DELVE found the unique spinning angles perfectly.
Real Data: They used smartphone sensors (accelerometers) to track human movement.
- One sensor measured gravity (posture: sitting, standing, lying down).
- The other measured motion (walking, running).
- Standard methods mixed them up. DELVE successfully separated "how you are sitting" from "how you are walking," allowing for much better classification of activities.

The Bottom Line

For years, scientists have been obsessed with finding what different data sources have in common. This paper flips the script. It says, "Don't just look for the common ground; look for the differences."

DELVE is a new lens that allows us to see the unique, modality-specific secrets hidden in our data, turning what was once considered "noise" into valuable, actionable information. It's like finally hearing the solo instrument in a band that everyone else thought was just part of the background noise.

Here is a detailed technical summary of the paper "Spectral Graph Filtering for Modality-Specific Representation Learning" (DELVE).

1. Problem Statement

The paper addresses a critical gap in multimodal representation learning. While many existing methods focus on identifying shared latent structures (common factors) across different data modalities (e.g., gene expression and epigenetic markers from the same cell), they often fail to capture modality-specific (differential) latent variables.

The Setting: Consider two sensors, $A$ $A$ and $B$ $B$ , observing the same set of $n$ $n$ objects. Each observation is generated by a set of latent variables:
- $\theta$ : Shared latent variables (observable by both sensors).
- $\psi_A$ : Variables unique to sensor $A$ (invisible to $B$ ).
- $\psi_B$ : Variables unique to sensor $B$ (invisible to $A$ ).
The Goal: To learn low-dimensional embeddings that explicitly isolate and recover the differential variables ( $\psi_A$ and $\psi_B$ ) while suppressing the shared signal ( $\theta$ ). This is crucial for tasks like identifying cell subtypes that appear in one modality but not another, or distinguishing specific motion patterns in different camera views.

2. Methodology: DELVE (Differential Latent Variables Extraction)

The authors propose DELVE, a spectral graph-based algorithm that leverages Graph Signal Processing (GSP) to filter out shared signals and preserve differential ones.

Core Concept: Graph Filtering via Connectivity Differences

The method constructs separate graphs for each modality ( $G_A$ and $G_B$ ) based on the similarity of observations within that modality. The key insight is that the connectivity patterns (graph structure) of $G_A$ and $G_B$ differ significantly regarding the shared variable $\theta$ versus the differential variables.

Algorithm Steps:

Graph Construction:
- Construct weight matrices $W^A$ and $W^B$ using kernel functions (e.g., Gaussian) on datasets $X^A$ and $X^B$ .
- Compute symmetric normalized Laplacian matrices $L^A$ and $L^B$ .
Spectral Filtering Design:
- The algorithm identifies the eigenvectors of $L^A$ corresponding to the shared variable $\theta$ (typically the low-frequency components).
- A high-pass filter $H(L^A)$ is designed based on the spectrum of $L^A$ . This filter attenuates components correlated with the leading eigenvectors of $L^A$ (which represent $\theta$ ) while preserving high-frequency components.
Filtering and Extraction:
- Apply the filter to the operator of the other modality. For example, to extract $\psi_B$ , the filter $H(L^A)$ is applied to the operator $P^B$ (derived from $G_B$ ):
  $\tilde{P}^B = H(L^A) P^B H(L^A)$
- The leading eigenvector of the filtered operator $\tilde{P}^B$ is the differential vector $\delta^B$ , which encodes $\psi_B$ .
Iterative Extension (Multiple Variables):
- To recover multiple differential variables (e.g., $\psi_B^1, \psi_B^2$ ), the method uses an iterative approach.
- After extracting the first differential vector, it is concatenated with the shared space representation.
- The filter is re-applied to the residual data to extract the next differential component, ensuring non-redundancy.

3. Key Contributions

Novel Algorithm (DELVE): A spectral method specifically designed to extract modality-specific latent variables, contrasting with the prevailing focus on shared structures.
Theoretical Guarantees:
- The authors establish asymptotic convergence under a Product Manifold Model.
- They prove that the differential vectors converge to the eigenfunctions of the Laplace-Beltrami operator associated with the unique manifold components ( $\psi_A, \psi_B$ ).
- They demonstrate that eigenvectors associated with shared variables are nearly orthogonal to those of differential variables, justifying the filtering approach.
Iterative Framework: A procedure to recover multiple differential dimensions without redundancy, addressing the limitation of naive spectral decomposition where higher-order eigenvectors might mix shared and differential signals.
Comprehensive Evaluation: Extensive testing on synthetic data (lines, rectangles, tori) and real-world datasets (rotating dolls, accelerometer sensors) against state-of-the-art baselines.

4. Experimental Results

The paper evaluates DELVE against FKT (Fukunaga-Koontz Transform) and Shnitzer et al. (2019) (an alternating diffusion extension).

Synthetic Data (Rectangle vs. Line, Tori):
- DELVE achieved near-perfect correlation ( $>0.99$ ) with the ground-truth differential variables.
- Competing methods (Shnitzer+, FKT) failed to isolate the differential variables, often recovering the shared variable or producing noise.
Rotating Dolls (Real Image Data):
- DELVE successfully recovered the rotation angles of the unique dolls (Yoda and Rabbit) with high correlation ( $>0.92$ ).
- Shnitzer+ showed negligible correlation, failing to separate the unique motion from the shared bulldog motion.
Accelerometer Sensors (Human Activity Recognition):
- In clustering tasks (walking vs. sitting vs. lying), using only the differential vectors from DELVE yielded higher Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) scores than using shared vectors alone.
- Combining shared and differential vectors provided the best performance, proving that differential components contain complementary, non-redundant information.

5. Significance and Impact

Paradigm Shift: The paper challenges the assumption that shared structure is the only valuable signal in multimodal data. It demonstrates that differential structure is often the key to distinguishing fine-grained classes (e.g., specific cell subtypes or activity nuances).
Theoretical Foundation: By providing convergence guarantees under a product manifold model, the work bridges the gap between heuristic spectral methods and rigorous manifold learning theory.
Practical Utility: The method is computationally efficient ( $O(n^2)$ with approximations) and applicable to diverse domains including computational biology (multi-omics), neuroscience, and computer vision.
Future Directions: The authors suggest extending the method to non-Euclidean kernels, integrating semi-supervised learning, and applying it to complex modalities like neuroimaging and genomics.

In summary, DELVE provides a principled, theoretically grounded framework for "unmixing" multimodal data to reveal hidden, modality-specific insights that are otherwise lost in standard joint embedding techniques.

Spectral Graph Filtering for Modality-Specific Representation Learning

How Does It Work? (The "Noise-Canceling" Headphones Analogy)

Why Do We Need This?

The "Magic" of the Method

The Bottom Line

1. Problem Statement

2. Methodology: DELVE (Differential Latent Variables Extraction)

Core Concept: Graph Filtering via Connectivity Differences

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model