The Mechanistic Invariance Test: Genomic Language Models Fail to Learn Positional Regulatory Logic

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are teaching a super-smart robot how to write a song. You show it millions of songs and ask it to learn the rules of music. Eventually, the robot gets really good at predicting the next note in a song. It sounds like a genius composer.

But here's the catch: Does the robot actually understand music theory, or is it just guessing based on how often certain notes appear together?

This paper is a "final exam" for a new generation of AI models that study DNA (the "language of life"). The researchers wanted to know: Do these AI models actually understand how genes work, or are they just cheating by looking for simple patterns?

The Setup: The "Promoter" Puzzle

To understand the test, you need a tiny bit of biology. Think of a gene as a light switch. To turn the light on, you need two specific things in the right order:

The Switch (-10 Box): A specific sequence of letters (like TATAAT).
The Helper (-35 Box): Another sequence nearby.

Crucially, these two must be exactly a certain distance apart (like 17 steps). If they are too close or too far, the light won't turn on.

Sometimes, if the "Switch" is broken (weak), nature has a backup plan: a "Helper" element (called a UP element) that is very rich in the letter A and T. But this Helper only works if it is placed in a specific spot before the Switch. If you put that same Helper in the wrong spot, it does nothing.

The Test: The "Mechanistic Invariance Test" (MIT)

The researchers created a test with 650 DNA sequences to see if the AI models could tell the difference between Position and Composition.

They gave the models two types of puzzles:

Puzzle A (The Real Deal): A broken switch with a Helper placed in the correct spot. (This should work).
Puzzle B (The Scam): A broken switch with the exact same Helper letters, but placed in the wrong spot (far away). (This should fail).

If the AI truly understands biology, it should say: "Puzzle A is good, Puzzle B is bad."
If the AI is just cheating, it might say: "Both are good! They both have lots of A's and T's!"

The Results: The AI is "Compositionally Blind"

The results were shocking. The AI models failed the test spectacularly.

They are "Letter Counters": The models didn't care where the letters were. They just saw that the "Helper" sequences were full of A's and T's. Since A's and T's often appear in working genes, the models thought, "Oh, lots of A's and T's = Good Gene!"
They got the position wrong: In some cases, the AI actually rated the wrong position higher than the correct one! It was like a music teacher saying, "This song is great because it has a lot of C notes," even if those C notes were played at the wrong time.
Bigger isn't better: The researchers tested models with billions of parameters (the "smartest" AIs). Surprisingly, the bigger the model, the worse it got at this specific logic. They just got better at counting letters, not understanding the rules.

The "Simple" Solution

Here is the most embarrassing part for the AI industry:
The researchers built a tiny, simple model with only 100 parameters (basically a calculator) that used basic biological rules.

The Giant AI (Billions of parameters): Failed.
The Tiny Calculator (100 parameters): Got a perfect score.

This proves the problem isn't that the AI isn't "smart" enough or needs more data. The problem is that the AI is learning the wrong shortcuts. It's memorizing the "vibe" of the DNA rather than the "grammar" of how it works.

The Analogy: The "Red Car" vs. The "Traffic Light"

Imagine you are trying to teach a self-driving car to stop at a red light.

The AI's current method: It learns that "Red things usually mean stop." So, if it sees a red stop sign, it stops. If it sees a red fire truck, it stops. If it sees a red light that is actually green (because of a glitch), it stops. It's just reacting to the color Red.
What it should learn: It needs to understand that Red is a specific signal in a specific context (a traffic light) that means Stop.

The AI models in this paper are like the car that stops at every red object. They see the "Red" (the A/T rich DNA) and think it's a working gene, even if the "traffic light" is in the wrong place.

Why Does This Matter?

If we use these AI models to design new medicines or edit genes (gene therapy), we are in trouble.

If the AI thinks a gene works just because it has the right "letters," but the "letters" are in the wrong order, the medicine might fail or cause harm.
The paper argues that before we trust these AIs with human health, we need to redesign them to understand position and rules, not just patterns and statistics.

In short: The AI is a brilliant mimic that can copy the sound of a symphony, but it doesn't know how to conduct the orchestra. We need to teach it the conductor's baton, not just the sheet music.

1. Problem Statement

Genomic Language Models (gLMs) have achieved state-of-the-art performance in predicting variant effects, modeling gene expression, and discovering regulatory elements. However, a critical gap remains: do these models learn the underlying mechanistic principles of gene regulation, or do they merely exploit statistical shortcuts (compositional heuristics)?

Specifically, gene regulation relies heavily on positional constraints (e.g., the precise spacing between promoter elements). If gLMs fail to encode these constraints and instead rely on nucleotide composition (e.g., "AT-richness"), they will fail to generalize to novel configurations required for synthetic biology, gene therapy, and clinical interpretation. The authors hypothesize that current gLMs are "compositionally sensitive" but "positionally blind."

2. Methodology: The Mechanistic Invariance Test (MIT)

To rigorously test this hypothesis, the authors introduced the Mechanistic Invariance Test (MIT), a benchmark designed to discriminate between compositional sensitivity and genuine positional understanding.

A. Benchmark Design

Dataset: 650 synthetic and natural DNA sequences (100 bp each) organized into 8 classes.
Biological Context: Based on E. coli $\sigma^{70}$ promoters, which require a -35 box (TTGACA) and a -10 box (TATAAT) with a strict 17±1 bp spacer.
Compensation Mechanism: Weak -10 boxes can be compensated by an UP element (AT-rich, upstream of -35) or an extended -10 motif (TGT), but only if placed at the correct position.
Key Classes:
- Class D (Broken): Weak -10 box, no compensation.
- Class E (Compensated): Weak -10 box + UP element/extended -10 at the correct position.
- Class H (Scrambled Control): Identical nucleotide composition to Class E, but the UP element is moved to a wrong position (downstream of -35).
Logic: A model with mechanistic understanding should score Class E > Class H. A model relying only on composition should score E $\approx$ H.

B. Evaluation Metrics

Compensation Sensitivity Score (CSS): Measures if compensated sequences score higher than broken ones. ($CSS > 0.5$ implies sensitivity).
Scramble Control Ratio (SCR): Measures if the model distinguishes correctly positioned compensation (Class E) from scrambled compensation (Class H). ( $SCR \gg 0.5$ implies positional awareness).
Motif Effect Size (MES): Quantifies the ability to distinguish intact vs. broken motifs.

C. Models Evaluated

Five gLMs spanning three architectural paradigms were tested:

Autoregressive: HyenaDNA, Evo2-1B (1B parameters).
Masked Language Models (MLM): GROVER, Nucleotide Transformer (NT-500M).
Bidirectional State-Space Models (SSM): Caduceus (incorporating Mamba).

D. Mechanistic Probing Experiments

To isolate the drivers of model predictions, the authors conducted:

AT Titration: Varying background AT content (30%–80%) to test correlation with log-likelihood.
Positional Ablation: Moving the UP element to incorrect positions.
Spacing Sensitivity: Varying the -35/-10 spacer length.
Strand Orientation: Testing forward vs. reverse-complement sequences.

3. Key Results

A. Universal Failure of Positional Logic

SCR Performance: All gLMs scored near or below chance ( $SCR \approx 0.40–0.52$ ) on the Scramble Control Ratio. They could not distinguish correctly positioned compensation from scrambled sequences.
Inversion of Reality: Models like Evo2-1B and Caduceus actually scored the incorrectly positioned UP elements (Class H) higher than the correct ones (Class E), inverting biological reality.
Strand Blindness: All models were effectively strand-blind, scoring forward and reverse-complement sequences with similar likelihoods (accuracy $\approx 44–50\%$ ).

B. The "Compositional Illusion"

AT Content Correlation: There was a strong positive correlation ( $r = 0.78–0.96$ ) between AT content and log-likelihood across all architectures.
Mechanism: The "compensation" in Class E works biologically because it adds AT-rich elements. The models detect the increase in AT content, not the positional logic.
Effect Size: Compositional effects (AT content) dominated positional effects by a factor of 46-fold. Removing the UP element had a much larger impact on the score than moving it to the wrong position.

C. The "Small Model" Baseline

A simple 100-parameter Position-Aware PWM (PA-PWM) achieved perfect performance ($CSS=1.00, SCR=0.98$).
A Relative Position-Aware PWM (RPA-PWM), which uses no hardcoded positions but enforces relative biological constraints (e.g., "UP must be upstream of -35"), also achieved near-perfect performance ($CSS=1.00, SCR=0.92$).
Conclusion: The failure is not due to a lack of model capacity (billion-parameter models failed), but due to misaligned inductive biases.

D. Scale Amplifies Bias

Larger models showed stronger compositional bias. Evo2-1B (1B params) had a higher AT-log-likelihood correlation ( $r=0.96$ ) than HyenaDNA (6.6M params, $r=0.78$ ). Scaling up current architectures amplifies the reliance on statistical shortcuts rather than correcting them.

4. Key Contributions

MIT Benchmark: A rigorous 650-sequence benchmark with scrambled controls to cleanly separate compositional sensitivity from positional understanding.
Systematic Probing: Comprehensive mechanistic experiments (AT titration, ablation, spacing, strand) proving that gLMs rely on AT-content heuristics.
Architectural Diagnosis: Demonstration that the failure is universal across autoregressive, masked, and SSM architectures, suggesting a fundamental flaw in standard pretraining objectives.
Baseline Superiority: Proof that a 100-parameter biophysical model outperforms billion-parameter gLMs, indicating that the bottleneck is inductive bias, not capacity.

5. Significance and Implications

Fundamental Limitation: Current gLMs capture surface statistics (nucleotide frequencies) while missing the "positional grammar" essential for gene regulation.
Risk in Deployment: Relying on these models for synthetic biology or gene therapy is risky, as they may fail unpredictably when applied to novel sequence configurations where composition and position are decoupled.
Path Forward: The paper argues that increasing model scale is not the solution. Instead, future architectures must incorporate:
- Position-aware attention with motif-specific distance biases.
- Hybrid architectures combining neural models with differentiable biophysical modules (e.g., PWMs).
- Contrastive training objectives that explicitly force models to distinguish between matched composition but different positions.

In summary, the paper provides a stark warning: Genomic AI is currently "hallucinating" mechanistic understanding. It succeeds on standard benchmarks by learning statistical correlations but fails to grasp the causal, positional logic of biology.