Benchmarking GNN Models on Molecular Regression Tasks with CKA-Based Representation Analysis

Imagine you are trying to teach a computer to predict how a new medicine will behave—will it dissolve in water? Will it cross the blood-brain barrier? To do this, the computer needs to "see" the molecule.

For decades, scientists have used two main ways to show molecules to computers. This paper is a head-to-head race to see which method works best, especially when you don't have a massive amount of data (which is common in drug discovery).

Here is the breakdown of the study using simple analogies:

1. The Two Ways to Describe a Molecule

Think of a molecule like a complex city.

The Old Way (Fingerprints/ML): Imagine you have a Wanted Poster. It lists specific details: "Has 5 red buildings, 2 bridges, and a park." This is a Molecular Fingerprint. It's a fixed list of facts created by human experts. It's great, but it's static. It doesn't tell you how the buildings connect, just that they exist.
The New Way (GNNs): Imagine giving the computer a 3D Map of the city where the streets and buildings are connected in real-time. This is a Graph Neural Network (GNN). Instead of a list, the computer looks at the structure: "How does the park connect to the bridge? How does the traffic flow?" It learns the relationships automatically.

2. The Race: Who Wins?

The researchers took four different types of "3D Map" readers (called GCN, GAT, GIN, and GraphSAGE) and pitted them against the "Wanted Poster" readers (standard Machine Learning models) on four different types of chemical puzzles.

The Result:

The "Wanted Poster" (Old ML) won the small race. When the dataset was small (only 1,000 molecules), the old method was more accurate.
- Why? Think of it like teaching a child. If you only show them 10 pictures of dogs, they learn best if you give them a simple checklist ("Has fur, has four legs"). If you try to teach them the complex 3D structure of a dog with only 10 pictures, they get confused. The "Wanted Poster" acts as a helpful cheat sheet that prevents the computer from guessing wrong.
The "3D Map" (GNN) struggled alone. The new models were a bit worse at predicting the answers on their own. They needed more data to learn the complex patterns.

3. The Winning Strategy: The "Super-Team"

Here is the paper's biggest discovery. Instead of choosing one or the other, the researchers built a Hybrid Team.

They took the 3D Map (GNN) and the Wanted Poster (Fingerprint) and glued them together.

The Result: This "Super-Team" beat both the old method and the new method alone.
The Analogy: It's like having a detective who is great at spotting physical clues (the fingerprint) and a detective who is great at understanding social connections and traffic patterns (the GNN). When they work together, they solve the case much faster and more accurately than either could alone.

4. The "Brain Scan" Analysis (CKA)

The researchers didn't just look at who won; they looked at how the models thought. They used a tool called CKA (Centered Kernel Alignment) to see if the models were "thinking" the same way.

The "Clone" Effect: They found that three of the four "3D Map" models (GCN, GraphSAGE, GIN) were essentially thinking in almost the exact same way. They were like three students who all memorized the same textbook. They were very similar to each other.
The "Unique" Thinker: One model, called GAT, was different. It paid attention to specific connections (like a detective focusing on a specific suspect). It thought differently than the others.
The "Alien" Language: Most importantly, they found that the "3D Map" models and the "Wanted Poster" models were speaking completely different languages. They were looking at the molecule from totally different angles.
- Why this matters: Because they were so different, combining them was like adding a new dimension to the problem. They didn't overlap; they filled in each other's blind spots.

The Bottom Line

If you are trying to predict chemical properties with a small amount of data:

Don't rely on just the new "3D Map" models; they need more data to shine.
Don't rely on just the old "Wanted Poster" lists; they miss the structural nuance.
Combine them. The best approach is to let the computer look at the molecule's structure and its fixed features simultaneously.

In short: The paper proves that while new AI models are powerful, they aren't magic yet. The smartest move is to let the new AI learn from the old, trusted experts, creating a "Super-Team" that is better than the sum of its parts.

Here is a detailed technical summary of the paper "Benchmarking GNN Models on Molecular Regression Tasks with CKA-Based Representation Analysis."

1. Problem Statement

Molecular property prediction is a cornerstone of drug discovery and materials science. Traditionally, this relies on fixed-size molecular fingerprints (e.g., ECFP4) and classical Machine Learning (ML) models. While effective, these approaches suffer from:

Reliance on Expert Engineering: They depend on hand-crafted features based on chemical intuition.
High Dimensionality & Sparsity: Fingerprints are often high-dimensional sparse vectors requiring large datasets to avoid overfitting.
Poor Generalizability: They struggle with Out-of-Distribution (OOD) data and lack explainability.

Graph Neural Networks (GNNs) offer an alternative by learning structural patterns directly from molecular graphs (atoms as nodes, bonds as edges). However, their efficacy on small datasets (common in biological discovery) is debated. Furthermore, it is unclear whether different GNN architectures (GCN, GAT, GIN, GraphSAGE) learn truly distinct representations or if they converge to redundant manifolds, and how they compare to traditional fingerprints in terms of latent space similarity.

2. Methodology

Datasets and Preprocessing

The study utilized four diverse regression datasets spanning three domains:

Physical Chemistry: ESOL (aqueous solubility) and Lipophilicity (log D).
Biological: B3DB (blood-brain barrier permeability).
Analytical: Retention Time (RT) from liquid chromatography.
Constraints: The FreeSolv dataset was excluded due to size. All datasets were downsampled to 1,000 molecules to simulate small-data scenarios. SMILES strings were standardized (tautomer normalization, neutralization) using RDKit.

Model Architectures

The study compared three categories of models:

Classical ML Baselines: Linear Regression, SVM, Random Forest, and XGBoost trained on 1024-bit ECFP4 fingerprints.
Standalone GNNs: Four architectures (GCN, GAT, GIN, GraphSAGE) implemented with a single graph convolution layer followed by global mean pooling and a 2-layer MLP regression head.
- Design Choice: A single-layer depth was intentionally chosen to isolate the fundamental inductive biases of the aggregators without the confounding factor of deep network complexity.
Hybrid Fusion Framework (GNN+FP): A hierarchical model that concatenates graph-level embeddings (from GNN) with fingerprint embeddings (projected via a linear layer) before the final regression head.

Evaluation Metrics & Analysis Tools

Performance: Root Mean Square Error (RMSE) with 95% confidence intervals (via bootstrap).
Representational Similarity: Centered Kernel Alignment (CKA) with an RBF kernel was used to quantify similarity between:
- GNN embeddings vs. Fingerprint embeddings.
- GNN embeddings vs. other GNN embeddings (Cross-Architecture).
- Note: The "median trick" was used to set the kernel bandwidth $\sigma$ for scale invariance.

3. Key Contributions

Systematic Benchmarking: A comprehensive comparison of four GNN architectures against strong ML baselines across four distinct molecular domains under small-data constraints.
Hierarchical Fusion Framework: Proposal and validation of a GNN+FP fusion architecture that consistently outperforms standalone models.
CKA-Based Representation Analysis:
- Demonstrated that GNN and Fingerprint embeddings occupy highly independent latent spaces (Low CKA), justifying the fusion approach.
- Revealed that isotropic GNNs (GCN, GraphSAGE, GIN) converge to nearly identical representations (High CKA $\ge$ 0.88), whereas anisotropic GNNs (GAT) learn distinct, unique features (Moderate CKA 0.55–0.80).
Insight on Small Data: Clarified that while GNNs underperform on small datasets due to data hunger, they offer scalability and explainability that fingerprints lack.

4. Results

Predictive Performance (RMSE)

Baseline vs. GNN: Classical ML models trained on fingerprints generally outperformed standalone GNNs (RMSE improvements of 17%–27% for ML over GNN). This is attributed to fingerprints acting as powerful regularizers on small datasets (1,000 samples) where single-layer GNNs cannot learn complex hierarchies.
Hybrid Success: The GNN+FP fusion framework consistently outperformed or matched both standalone GNNs and ML baselines.
- Average RMSE Improvement:
  - RT: 26.13%
  - ESOL: 22.72%
  - Lipophilicity: 15.19%
  - B3DB: 7.06%
- Top Performer: The GAT + FP model showed the highest individual improvement on ESOL (29.73%), while GraphSAGE + FP excelled on RT.

Representational Similarity (CKA)

GNN vs. FP: CKA scores were low (0.29–0.32 for most datasets, max 0.46 for ESOL), confirming that GNNs learn structural information orthogonal to fixed fingerprints. This orthogonality explains the performance gain in fusion models.
GNN vs. GNN:
- Isotropic Convergence: GCN, GraphSAGE, and GIN showed extremely high similarity (CKA $\ge$ 0.88, up to 0.992 in B3DB), suggesting they learn redundant representations on small molecular graphs.
- GAT Divergence: GAT showed significantly lower similarity (CKA 0.55–0.80) with other architectures, indicating it captures unique relational features via its attention mechanism.

5. Significance and Conclusion

Model Selection: For small datasets, the choice between isotropic GNNs (GCN, GIN, GraphSAGE) matters little as they converge to similar representations. However, GAT offers a distinct perspective that, when fused with fingerprints, yields superior results.
Fusion Strategy: The study validates that combining the global structural motifs of fingerprints with the learned topological features of GNNs creates a robust predictor that overcomes the data limitations of pure GNNs.
Future Outlook: While fingerprints currently win on small data, GNNs are more scalable. As data volume increases, GNNs are expected to surpass ML baselines, but the fusion approach remains a powerful strategy for maximizing performance in data-scarce regimes (common in drug discovery).

In summary, the paper argues that while standalone GNNs struggle with small molecular datasets compared to fingerprint-based ML, a hybrid approach leveraging the complementary nature of these representations (proven via CKA) provides the state-of-the-art solution for molecular regression.