Information Routing in Atomistic Foundation Models: How Task Alignment and Equivariance Shape Linear Disentanglement

Imagine you are trying to teach a robot to recognize the difference between two houses. One house is made of red bricks, and the other is made of blue bricks. But here's the catch: the red house is always a castle, and the blue house is always a cottage.

If you ask the robot, "Is this a castle?" it might just look at the color. "Red? Castle! Blue? Cottage!" It gets the answer right, but it hasn't actually learned what a castle looks like. It's just memorized that "Red = Castle."

This is exactly the problem scientists face with AI models that predict how molecules behave. Molecules have two main parts:

Ingredients (Composition): What atoms are inside? (Carbon, Hydrogen, Oxygen, etc.)
Shape (Geometry): How are those atoms arranged in 3D space?

Most AI models are great at predicting properties (like energy or reactivity), but researchers didn't know if the models were actually learning the shape of the molecule or just cheating by looking at the ingredients.

This paper introduces a new way to "peel back the layers" of these AI models to see what they are really thinking. Here is the breakdown in simple terms:

1. The Magic Trick: "The Ingredient Eraser"

The researchers invented a method called Compositional Probe Decomposition (CPD). Think of it like a magic eraser.

The Problem: If you ask an AI, "What is the energy of this molecule?" and it says "High," is it because of the shape or the ingredients? It's hard to tell because they are mixed together.
The Solution: The researchers use math to forcibly remove all information about the ingredients from the AI's brain. They strip away the "Red/Blue" signal.
The Test: Now, they ask the AI, "Okay, you don't know the ingredients anymore. Can you still tell me about the shape?"

If the AI can still answer correctly, it means it truly learned the geometry. If it fails, it was just cheating by looking at the ingredients.

2. The Big Discovery: The "Training Goal" Matters Most

The researchers tested 10 different AI models. They found a huge gap in performance. Some models were amazing at seeing shapes after the ingredients were erased; others were terrible.

They discovered that what the model was trained to do mattered way more than how the model was built.

The Analogy: Imagine two students taking a test.
- Student A studied for a test on "How to build a bridge."
- Student B studied for a test on "How to paint a picture."
- Now, you ask both students to solve a "Bridge Physics" problem.
- Student A (trained on bridges) solves it easily, even if they use a slightly different method.
- Student B (trained on painting) struggles, even if they have a "smarter" brain architecture.

The paper found that models trained specifically on electronic properties (which depend heavily on shape) were much better at understanding geometry than models trained just on total energy (which depends mostly on ingredients).

The Lesson: If you want an AI to understand the 3D shape of a molecule, don't just give it a fancy architecture; train it on a task that requires it to pay attention to the shape.

3. The "Specialized Mailroom" (Information Routing)

Some models, like MACE, have a special internal structure. They have different "channels" for different types of information, kind of like a mailroom with specific slots for "Letters" and "Packages."

The researchers found that in MACE, Scalar information (like the HOMO-LUMO gap, which is a number) goes into the "Letter" slots.
Vector information (like a Dipole Moment, which has a direction) goes into the "Package" slots.

It's as if the model has learned to sort its own mail perfectly. However, another model called ViSNet didn't do this; it threw everything into one big pile. This shows that just having a fancy structure isn't enough; the model has to learn to use it correctly.

4. The Trap: Don't Use "Over-Engineered" Detectors

One of the most important warnings in the paper is about how we test these models.

The researchers tried using a very powerful, complex detector (called a Gradient Boosted Tree) to check the AI's knowledge. It gave amazing scores! But when they used a simple, linear detector, the scores dropped to zero.

The Analogy: Imagine you are trying to see if a room is empty.

You use a simple ruler (Linear Probe). It says, "The room is empty."
You use a super-complex laser scanner (Non-linear Probe). It finds a tiny, invisible dust particle and says, "The room is full!"

The complex scanner was "hallucinating" information. It was so good at finding patterns that it reconstructed the "ingredients" the researchers had tried to erase. The paper warns: When testing what an AI has learned after removing a variable, always use simple, linear tests. Complex tests will lie to you.

Summary: What Should We Take Away?

Training is King: If you want an AI to understand molecular shapes, train it on tasks that require shape awareness. A fancy architecture won't save a model trained on the wrong task.
Data Diversity Helps: Training on a massive, diverse dataset helps, but it can't fully fix a model trained on the wrong goal.
Keep It Simple: When trying to see what an AI knows, don't use overly complex tools to test it, or you might trick yourself into thinking it knows more than it does.

In short, this paper gave us a better "X-ray" to see inside AI brains, proving that how you teach a model is more important than what it's built out of.

Here is a detailed technical summary of the paper "Information Routing in Atomistic Foundation Models: How Task Alignment and Equivariance Shape Linear Disentanglement" by Joshua Steier.

1. Problem Statement

Atomistic foundation models (e.g., MACE, PaiNN, SchNet) predict molecular properties with high accuracy, but the internal organization of their representations remains opaque. A critical challenge in probing these models is the confounding effect of composition. Molecular properties (like energy or HOMO-LUMO gap) are strongly correlated with chemical composition (which elements are present).

The Issue: Standard probing techniques cannot distinguish whether a model has learned genuine geometric information (atomic arrangement) or simply memorized compositional shortcuts.
The Gap: Previous attempts to remove composition signals using nonlinear probes (like Gradient Boosted Trees) on residualized representations yielded misleadingly high scores, falsely suggesting that geometric information was accessible when it was actually reconstructed from the removed composition signal.
The Goal: To determine whether molecular representations are linearly disentangled—specifically, whether geometric information remains linearly accessible after the linear composition signal is strictly removed.

2. Methodology: Compositional Probe Decomposition (CPD)

The authors introduce Compositional Probe Decomposition (CPD), a rigorous probing framework designed to isolate geometric signals.

Core Mechanism:
1. Composition Feature Extraction: Construct a vector $Z$ representing element fractions and atom counts for each molecule.
2. Linear Projection (Removal): Fit an Ordinary Least Squares (OLS) regression to project the model's internal representations ( $X$ ) onto the composition subspace ( $Z$ ). The residual $X_{geom} = X - Z\hat{\beta}$ represents the component of the representation linearly orthogonal to composition.
3. Fold-wise Execution: To prevent information leakage, the projection coefficients are fitted only on the training fold within a cross-validation scheme.
4. Linear Probing: A Ridge regression probe is trained on $X_{geom}$ to predict the target property. The resulting $R^2$ ( $R^2_{geom}$ ) measures the linearly accessible geometric information.
Key Methodological Insight: The paper demonstrates that nonlinear probes (e.g., GBTs) are invalid for residualized representations. GBTs can reconstruct the projected-out composition signal from high-dimensional residuals, artificially inflating scores (e.g., achieving $R^2 \approx 0.9$ on a purely compositional target). Therefore, linear probes are mandated for faithful measurement in this setting.
Validation: The method is validated via:
- Structural Isomer Benchmark: Testing on isomers (identical composition, different geometry). The compositional component scores at chance (50%), while the geometric residual achieves high classification accuracy (up to 94.6%), proving successful separation.
- Robustness Checks: The results hold across different composition definitions, projection methods (vs. LEACE concept erasure), and sample sizes.

3. Key Contributions

CPD Framework: A validated methodology for measuring linear geometric accessibility by removing composition confounds, accompanied by a warning against using nonlinear probes on residuals.
The Linear Accessibility Gradient: Discovery of a 6.6× spread in geometric accessibility across ten models from five architectural families.
Three-Factor Determinants: Identification of the factors shaping this gradient, revealing that Task Alignment is the dominant factor, outweighing architecture and data diversity.
Information Routing in MACE: Evidence that equivariant models like MACE route information through specific symmetry channels (scalar vs. vector) based on the physical symmetry of the target property.

4. Key Results

A. The Linear Accessibility Gradient

Across 10 models on the QM9 dataset, $R^2_{geom}$ for the HOMO-LUMO gap ranged from 0.081 (MACE trained only on energy) to 0.533 (PaiNN trained on HOMO-LUMO gap).

B. The Three Factors Shaping the Gradient

Task Alignment (Dominant Factor):
- Models trained on HOMO-LUMO gap (a geometry-sensitive property) consistently outperformed models trained on Total Energy (composition-dominated) by a margin of $\Delta R^2 \approx 0.25$ .
- Ablation: Retraining PaiNN from HOMO-LUMO to Energy dropped $R^2_{geom}$ from 0.533 to 0.310. Retraining MACE dropped it from 0.44 to 0.08.
- Conclusion: The training objective dictates the organization of representations more than the architecture itself.
Equivariance (Conditional Factor):
- Equivariance alone does not guarantee high geometric accessibility. MACE (equivariant) trained on energy scored lower (0.081) than SchNet (invariant) trained on energy (0.262).
- Equivariance amplifies accessibility only when paired with a task-aligned objective.
Data Diversity (Compensatory Factor):
- Pretraining on diverse datasets (e.g., MACE on MPTraj) partially compensates for misaligned objectives. MACE pretrained on diverse data scored 0.364, significantly better than QM9-only energy models, but still below task-aligned models.

C. Information Routing by Symmetry (MACE)

In MACE, information is routed through irreducible representations (irreps) of SO(3):

Scalar Channels (L=0): Preferentially encode scalar properties like the HOMO-LUMO gap ( $R^2 = 0.76$ ).
Vector Channels (L=1): Preferentially encode vector properties like Dipole Moment ( $R^2 = 0.59$ ).
Contrast: ViSNet, another equivariant model, concentrates all information in its scalar stream, suggesting its equivariant operations are internal computational tools rather than persistent structural routing mechanisms.

D. Nonlinear Probe Inflation

On a purely compositional target (Average Atomic Mass), Ridge regression correctly returned $R^2 \approx 0$ on the geometric residual. However, Gradient Boosted Trees (GBTs) recovered $R^2$ between 0.68 and 0.95, proving they reconstruct the removed signal.

5. Significance and Implications

For Model Selection: Practitioners should prioritize training objective alignment over architectural choices (e.g., equivariance) when selecting pre-trained models for geometry-sensitive downstream tasks. A model trained on the specific target property yields representations where the signal is linearly accessible.
For Foundation Models: Large-scale pretraining on diverse data (e.g., MPTraj) creates a "partial escape" from task misalignment, offering broadly accessible geometric information even for unseen targets, though it does not fully replace task-specific training.
Methodological Correction: The paper establishes that linear probes are essential for probing residualized representations. Using nonlinear probes on concept-erased data leads to systematic overestimation of model capabilities, a finding relevant to NLP and computer vision probing as well.
Representation Theory: The results challenge the "Platonic Representation Hypothesis" (that all models converge to the same structure). Instead, they show that training objectives and architectural inductive biases interact to produce qualitatively different representation structures.

6. Limitations

Scope: Primarily tested on small organic molecules (QM9) and some crystals; generalizability to large drug-like molecules or proteins is unverified.
Linear vs. Nonlinear: CPD measures linear accessibility. Models with low $R^2_{geom}$ may still encode rich geometric information that is only accessible via nonlinear readouts (e.g., ANI-2x).
Frozen Representations: The study probes frozen models; fine-tuning could alter these accessibility metrics.

In summary, the paper provides a rigorous framework for dissecting molecular representations, revealing that what a model is trained to predict is the primary driver of how it organizes geometric information, often superseding the influence of architectural design.