Fair splits flip the leaderboard: CHANRG reveals limited generalization in RNA secondary-structure prediction

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: The "Fake" Champion

Imagine a video game tournament where the goal is to predict the layout of a complex maze (the RNA structure) just by looking at a list of ingredients (the RNA sequence).

For a while, the "champions" of this tournament were Foundation Models (massive AI systems trained on huge amounts of data). They were beating everyone else, getting near-perfect scores on the official test maps. Everyone thought, "Wow, AI has finally cracked the code of RNA!"

This paper says: "Wait a minute. They aren't actually that good."

The authors argue that the previous tests were rigged. The AI wasn't learning the rules of the maze; it was just memorizing the specific maps it had seen before. When you gave it a new type of maze it had never seen, it got lost.

The Problem: Cheating on the Test

The researchers found three main ways the old tests were "too easy":

The "Copycat" Problem: The test maps were too similar to the training maps. It's like studying for a driving test by practicing on the exact same parking lot you'll be tested on, rather than learning how to drive on a rainy highway.
The "Family Secret" Problem: The test included RNA molecules from the same "family" as the training data. It's like a student taking a math test where the questions are just the same numbers as the homework, just shuffled around.
The "Batch" Glitch: The way computers processed the data was flawed. If you put a short RNA and a long RNA together in a batch, the computer's "padding" (filler space) would accidentally change the answer for the short one. It's like a chef changing the taste of a small soup because they are cooking it in the same giant pot as a huge stew.

The Solution: CHANRG (The "Hard Mode" Benchmark)

The authors created a new, stricter testing ground called CHANRG. Think of it as a "Survival Mode" for AI.

Structure-Aware Deduplication: They didn't just remove identical sequences; they removed sequences that looked structurally the same, even if the letters were different. This ensures the AI can't cheat by recognizing a "look-alike."
The Three "Out-of-Distribution" (OOD) Challenges: Instead of just testing on familiar data, they tested the AI on three terrifying scenarios:
- GenA: A completely new architecture of RNA the AI has never seen.
- GenC: RNA from a completely different evolutionary "clan" (like testing a cat on dog behavior).
- GenF: Rare RNA families where the AI has very little data to learn from.

The Results: The Leaderboard Flips

When they ran the old "champions" (the massive Foundation Models) through this new, hard test, the results were shocking:

The Giants Fell: The massive AI models, which were the stars of the old leaderboard, crashed hard. Their accuracy dropped by nearly 50-70% on these new challenges. They were great at memorizing, but terrible at adapting.
The Underdogs Won: The "old school" methods (Structured Decoders) and simpler neural networks, which use strict biological rules and logic, actually performed much better. They didn't get as high scores on the easy tests, but they were robust. They could handle the new, weird mazes because they understood the principles of folding, not just the patterns.

The Analogy:
Imagine two students taking a test.

Student A (Foundation Model): Memorized the answers to 1,000 practice questions. On the practice test, they got 99%. On the real test, where the questions are slightly different, they get a 20% because they don't understand the logic.
Student B (Structured Decoder): Learned the math formulas behind the questions. On the practice test, they got 85%. On the real test, with new numbers, they still get 80% because they know how to solve the problem.

Why Does This Matter?

We Were Wrong: We thought AI was ready to design new medicines and understand RNA biology. This paper says, "Not yet. We need to fix how we test them."
Better Tools: The authors also fixed the computer code used to run these tests. They removed the "padding" glitch, making the tests faster and fairer (like removing the giant pot so the small soup tastes right).
The Future: To build AI that can truly design RNA drugs, we need models that can generalize—models that can handle the "unknown" and not just the "familiar."

The Takeaway

The paper flips the script: The biggest, flashiest AI models are currently the most fragile. To make real progress in RNA science, we need to stop praising models for memorizing the past and start building models that can survive the future. The "Fair Splits" of CHANRG are the new standard for finding out who is actually the smartest.

1. Problem Statement

Accurate prediction of RNA secondary structure is critical for transcriptome annotation, mechanistic analysis, and therapeutic design. While deep learning and RNA foundation models (FMs) have shown significant improvements in recent benchmarks, the paper argues that these gains are likely inflated due to flaws in current evaluation practices:

Structural Leakage: Existing benchmarks often deduplicate data based solely on sequence identity. However, RNAs with modest sequence similarity can share highly similar secondary structures, leading to "leakage" where test examples are structurally redundant with training data.
Overestimation of Generalization: Current leaderboards favor models that fit well to familiar families (in-distribution) but fail to generalize to novel structural regimes, evolutionary clans, or sparse genomic contexts (out-of-distribution, OOD).
Evaluation Artifacts: Variable-length RNA sequences are typically padded to the longest sequence in a batch. This creates a "batch-context dependence" where predictions for a specific RNA can change based on which other sequences are in the same batch, undermining reproducibility and wasting computational resources.
Metric Limitations: Standard metrics (e.g., base-pair F1) often mask higher-order structural errors, such as incorrect junction wiring or topological mismatches.

2. Methodology: The CHANRG Benchmark

The authors introduce CHANRG (Comprehensive Hierarchical Annotation of Non-coding RNA Groups), a rigorous benchmark designed to address these limitations.

Data Curation:
- Source: Curated from Rfam 15.0, starting with over 10 million sequences.
- Deduplication: A multi-stage pipeline removes redundancy first by sequence identity (<99%) and then by structure-aware deduplication using bpRNA-CosMoS similarity scores (<0.9 structural identity). This reduced the dataset to 170,083 structurally non-redundant RNAs.
- Split Design: The dataset is split into Training, Validation, and Test sets, plus three distinct Out-of-Distribution (OOD) regimes:
  - GenA: Holds out "complex unclassified" architectural regimes.
  - GenC: Holds out entire evolutionary clans absent from training.
  - GenF: Holds out genome-sparse families with limited phylogenetic diversity.
- Leakage Control: Splits ensure no two sequences from the same reference genome appear in the same family across train/test boundaries.
Evaluation Framework:
- Metrics: A hierarchical metric ladder is used:
  - Base-pair F1: Local contact recovery.
  - Stem F1: Helix-level recovery.
  - Topology F1: Recovery of stems, loops, and their connections.
  - Topology GED: Graph Edit Distance (lower is better) measuring structural deviation.
- Compute Implementation: The authors provide a padding-free, symmetry-aware reference implementation using NestedTensor. This excludes padded positions from the computational graph and enforces symmetry at the output level, eliminating batch-context dependence and reducing memory/compute costs.
Models Benchmarked:
- 29 Predictors were evaluated, categorized into three classes:
  1. Structured Decoders (SD): Thermodynamic/statistical optimizers (e.g., RNAfold, EternaFold).
  2. Direct Neural Predictors (DL): Sequence-to-structure networks without pretrained language models (e.g., SPOT-RNA, bpFold).
  3. Foundation Models (FM): Pretrained RNA encoders with learned structure heads (e.g., RiNALMo, RNA-FM, ERNIE-RNA).

3. Key Results

A. The "Flip" in Leaderboards

In-Distribution (Test): Foundation Models (FMs) dominate, achieving the highest base-pair F1 (e.g., RiNALMo-Giga at ~0.76).
Out-of-Distribution (OOD): The performance hierarchy inverts. FMs suffer a catastrophic drop in generalization (retaining only ~26.7% of their Test performance).
- Structured Decoders (SD) and Direct Neural Predictors (DL) remain markedly more robust, retaining 82–92% of their Test performance on OOD data.
- Example: RiNALMo-Giga dropped from 0.7579 (Test) to 0.2140 (OOD mean), while EternaFold (SD) dropped from 0.3189 to 0.3064.

B. Scaling Does Not Solve OOD Failure

Increasing the scale of foundation models (from Micro to Giga parameters) significantly improves Test accuracy but yields diminishing returns on OOD robustness.
The gap between in-distribution and out-of-distribution performance widens or persists even with larger models, indicating that current scaling laws do not inherently solve structural transferability.

C. Nature of the Failure: Coverage vs. Wiring

Hierarchical analysis reveals two distinct failure modes for Foundation Models in OOD regimes:

Coverage Failure: FMs become overly conservative, omitting many true interactions (low recall) while maintaining high precision.
Wiring Failure: Even when FMs recover local helices (Stem F1), they fail to assemble them into the correct global topology.
- Topology F1 for FMs drops to near zero (0.07) on OOD data, whereas SDs retain ~88% of their topology performance.
- Case studies show FMs often predict correct local helices but connect them via incorrect junctions (e.g., multiloops), resulting in fundamentally wrong global architectures.

D. Computational Efficiency

The padding-free, symmetry-aware implementation reduces inference latency by 3.3x and GPU memory usage by 6.7x compared to conventional dense tensor padding.
It eliminates the "batch-context" artifact, where changing the batch composition alters the prediction for a specific sequence.

4. Key Contributions

CHANRG Benchmark: A large-scale, structure-aware, leakage-controlled benchmark that exposes the limited generalization of current state-of-the-art RNA foundation models.
Revised Performance Hierarchy: Demonstrates that "leaderboard leaders" (FMs) are not necessarily the most robust predictors for novel biological contexts, challenging the prevailing narrative of FM superiority in RNA structure.
Diagnostic Framework: Introduces a multi-scale evaluation (Base-pair $\to$ Stem $\to$ Topology) that identifies specific failure modes (coverage vs. wiring) rather than just aggregate accuracy.
Efficient Evaluation Stack: Provides a reproducible, padding-free computational framework that ensures batch-invariance and makes large-scale evaluation feasible.

5. Significance and Implications

Redefining Progress: The paper argues that progress in RNA structure prediction should not be measured solely by in-distribution accuracy on familiar families. True progress requires robustness across evolutionary and architectural shifts.
Inductive Bias: The results suggest that explicit structural constraints (as used in Structured Decoders) and task-aligned inductive biases remain crucial for transfer, potentially more so than raw parameter scaling in foundation models.
Future Directions: The findings highlight the need for models that learn transferable structural priors rather than just fitting sequence-to-structure mappings. It calls for benchmarks that prioritize topological fidelity and evolutionary novelty over simple sequence similarity.
Community Resource: CHANRG and its reference implementation provide a standardized, rigorous framework for the community to develop and evaluate the next generation of RNA structure predictors.