Benchmarking Artificial Intelligence Models for Predicting Nuclear Receptor Activity from Tox21 Assays

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a massive mystery: Which of the thousands of chemicals in our world are dangerous to our bodies' "control centers" (hormone receptors), and which are safe?

This paper is essentially a report card for a team of digital detectives (Artificial Intelligence models) trying to solve this mystery using a giant database called Tox21.

Here is the breakdown of their investigation, explained simply:

1. The Crime Scene: The Tox21 Database

Think of the Tox21 database as a massive library containing test results for nearly 10,000 different chemicals. Scientists have already tested these chemicals to see if they mess with 18 specific "control centers" in our body (called Nuclear Receptors, which regulate things like growth, reproduction, and metabolism).

The researchers took this library and organized it into 43 different case files. Some files had lots of "guilty" chemicals (active), while others had very few.

2. The Suspects: The AI Models

The researchers didn't just use one detective; they pitted three different types of AI against each other to see who was the best at spotting the dangerous chemicals:

The Traditional Detectives (Machine Learning): These are like old-school detectives who rely on a checklist of physical clues (fingerprints, height, weight). In the AI world, these are "descriptors" and "fingerprints" (mathematical summaries of a molecule's shape).
The Deep Thinkers (Deep Learning): These models look at the chemical structure like a complex 3D puzzle, trying to understand how the atoms connect and fit together.
The Super-Readers (Transformers): These are the new kids on the block. They treat chemical formulas like sentences in a language. They read the "story" of the molecule (its SMILES string) to guess its behavior, similar to how a language model predicts the next word in a sentence.

3. The Investigation: Who Won?

The researchers ran these detectives through all 43 case files. Here is what they found:

The "Crowded Room" Scenario (High Activity): When a case file had a decent number of dangerous chemicals (more than 10%), the Traditional Detectives (Machine Learning) won. Specifically, models like Random Forest and XGBoost using "descriptors" (the checklist approach) were the most accurate. They were great at finding patterns when there was enough data to learn from.
The "Sparse Room" Scenario (Low Activity): When a case file had very few dangerous chemicals (between 5% and 10%), the Deep Thinkers (Deep Learning) stepped up. They were more robust and didn't get confused by the lack of data as easily as the others.
The "Desert Island" Scenario (Very Low Activity): When there were almost no dangerous chemicals (less than 5%), the game became chaotic. No single detective was consistently good. It depended entirely on the specific quirks of that case file.

The Big Surprise: The "Super-Readers" (Transformers like ChemBERTa and MolRAG) didn't win as often as the researchers hoped. While they are powerful, they struggled to beat the simpler, checklist-based models on this specific type of data.

4. The "Why" Behind the Mistakes

Why did the AI sometimes get it wrong? The researchers looked at the "guilty" chemicals that the AI missed.

They found that about 40% of the missed chemicals were "Islands."

The Analogy: Imagine a map of the world. Most chemicals are in big cities (clusters of similar molecules). The AI learns by looking at the city.
The Problem: The chemicals the AI missed were like people living on isolated islands with no bridges to the mainland. Because the AI had never seen anything like them in its training data, it couldn't guess they were dangerous. They were too unique.

5. The Real-World Test (External Validation)

To make sure their detectives weren't just good at passing tests but actually useful, they sent them out to the real world. They tested the AI against real-life data on Androgen (male hormone) and Estrogen (female hormone) receptors.

The Result: The AI did a great job predicting male hormone disruptors (Androgen).
The Struggle: It struggled a bit more with female hormone disruptors (Estrogen), especially in living animals (in vivo).
The Reason: The AI was trained on "lab dish" (in vitro) data. Real life is messier; the body has metabolism and other processes that the lab dish doesn't capture. It's like training a driver only on a video game and then asking them to drive in a snowstorm.

6. The Final Verdict

This study is a massive benchmarking report. It tells us:

Don't use a sledgehammer for a nut: If you have a lot of data, simple, smart models (Machine Learning) work best.
Watch out for the "Islands": If a chemical is totally unique and has no "cousins" in the training data, even the smartest AI might miss it.
Context matters: The best model depends on how much data you have and what kind of chemical you are looking at.

In short: This paper helps scientists choose the right tool for the job. It shows that while fancy AI is cool, sometimes the old-school, well-organized checklist approach is still the most reliable way to predict if a chemical will disrupt our hormones. This helps us build better, safer tools to screen chemicals before they ever reach our environment.

1. Problem Statement

Nuclear receptors (NRs) are critical transcription factors regulating development and metabolism, making them primary targets for endocrine-disrupting chemicals (EDCs). Traditional animal testing for EDCs is costly, time-consuming, and ethically challenging. While in silico models offer a solution, existing studies on Tox21 data suffer from several limitations:

Data Limitations: Many studies rely on the older Tox21 Data Challenge dataset, which lacks distinction between agonist and antagonist activities and covers fewer receptors.
Methodological Gaps: Most prior work focuses on narrow sets of model architectures (mostly ML or simple DL) and chemical representations (often fingerprints only), lacking a systematic comparison with modern transformer-based models.
Class Imbalance: NR assays often suffer from severe class imbalance (few active compounds), yet few studies systematically analyze how this imbalance affects different model types (ML, DL, Transformers) and feature representations.
Generalizability: There is a need to understand how model performance correlates with the structural diversity of the chemical space and to validate models against external in vivo and in vitro data.

2. Methodology

Data Curation

Source: ToxCast invitrodb v4.3.
Scope: 43 datasets curated from 30 Tox21 assays covering 18 unique nuclear receptors (including THRA, THRB, AHR, PXR, RORC, RXRA, AR, ER $\alpha$ , ER $\beta$ , etc.).
Chemical Processing: 8,430 unique chemicals were processed. Invalid structures were removed, salts stripped, and 3D structures generated (ETKDG method, MMFF94 minimization).
Labeling: Chemicals were labeled 'active' (hit-call $\ge$ 0.9) or 'inactive'. 'Inconclusive' cases were excluded. For receptors with both agonist/antagonist data, combined activity profiles were created.
Splitting: Stratified 80-10-10 (train-val-test) or 80-20 splits were performed using three random seeds to ensure stability.

Feature Representation

The study utilized three distinct types of chemical features:

Descriptors: 1875 2D descriptors (PaDEL) and 231 3D descriptors (RDKit).
Fingerprints: MACCS (167-bit), Morgan/ECFP4 (1024-bit), Morgan/FCFP4 (1024-bit), and Layered (2048-bit).
SMILES: Raw molecular strings used as input for transformer models.

Model Architectures

A comprehensive benchmark of 54 model configurations per dataset was conducted:

Traditional Machine Learning (ML): 7 algorithms including Logistic Regression, Decision Trees, Random Forest (RF), Gradient Boosting (GBT), XGBoost, Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP).
Deep Learning (DL): Dual-graph neural networks contrastive learning (DGCL), a self-supervised GNN using Graph Attention Networks (GAT) and Graph Isomorphism Networks (GIN).
Transformers:
- ChemBERTa: Encoder-based model pretrained on ZINC.
- MoLFORMER: Trained on 1.1B molecules (PubChem/ZINC).
- MolRAG: Retrieval-Augmented Generation using Llama 3.1-8B, incorporating Chain-of-Thought reasoning and retrieval of structurally similar neighbors.

Training and Evaluation

Preprocessing: Zero-variance removal, Boruta feature selection (top 100 features), and SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance.
Metrics: F1-score was the primary metric due to class imbalance, supplemented by AUC-ROC, AUC-PR, Precision, Recall, and MCC.
Applicability Domain (AD): A DA index based on k-nearest neighbors (Tanimoto for fingerprints, Euclidean for descriptors) was used to filter unreliable predictions.
External Validation: Models were tested on independent NTP ICE datasets for AR, ER $\alpha$ , and ER $\beta$ (both in vivo and in vitro).

3. Key Contributions

Comprehensive Benchmarking: The first study to systematically compare ML, DL, and Transformer architectures across a broad range of Tox21 nuclear receptor assays (18 receptors, 43 datasets) using the latest ToxCast invitrodb v4.3.
Impact of Class Imbalance: Established a clear correlation between the proportion of active chemicals and model performance:
- >10% Active: Tree-based ML models (RF, XGBoost) with descriptors outperformed others.
- 5–10% Active: DL models (DGCL) showed greater robustness.
- <5% Active: Performance was highly variable and dataset-dependent; no single architecture dominated.
Structural Analysis of Errors: Identified that ~40% of misclassified active chemicals occupied isolated nodes in the Chemical Similarity Network (CSN). This suggests that the lack of structural analogues in the training set is a primary cause of failure, independent of class imbalance.
Feature Representation Insights: Demonstrated that molecular descriptors (alone or combined with fingerprints) generally outperformed fingerprint-only approaches, particularly for ML models.
External Validation: Validated models on independent in vivo and in vitro data, showing high concordance for AR and ER $\alpha$ agonists but lower performance for ER $\beta$ and AR antagonists, highlighting the complexity of in vivo mechanisms.

4. Key Results

Model Performance:
- Best Performers: Random Forest and XGBoost trained on descriptors (or descriptor + fingerprint combinations) achieved the highest F1 scores in datasets with >10% active compounds.
- Deep Learning: DGCL with descriptors performed best in datasets with 5–10% active compounds.
- Transformers: MoLFORMER was the top transformer, outperforming ChemBERTa and MolRAG. MolRAG struggled, likely due to the lack of domain-specific pre-training and the difficulty of retrieving informative neighbors for structurally isolated compounds.
Correlation with Active Ratio: A moderate positive correlation ( $r=0.68$ ) was found between the active ratio and the best F1 score. However, for datasets with <10% active compounds, this correlation vanished ( $r=0.03$ ), indicating that severe imbalance introduces noise that obscures general trends.
Chemical Space Topology: The study highlighted that the "isolated node" phenomenon in the CSN is a critical bottleneck. Models cannot generalize to active compounds that have no structural neighbors in the training set.
Comparison with Literature: The study's models generally matched or exceeded the performance of 13 previously published studies (which mostly used older datasets or fewer architectures). The use of invitrodb v4.3 provided more current and mechanistically specific data (agonist vs. antagonist).

5. Significance and Conclusion

This study provides a critical roadmap for developing reliable in silico tools for nuclear receptor bioactivity prediction within New Approach Methodologies (NAMs).

Practical Guidance: It advises that for datasets with moderate-to-high active ratios, tree-based ML models with descriptors are the most efficient and effective choice. For highly imbalanced datasets, deep learning offers better robustness, though performance remains challenging.
Limitation Awareness: The findings underscore that model performance is not just a function of algorithm choice but is heavily constrained by the structural diversity of the training data. The presence of active compounds in isolated regions of chemical space limits the predictive power of even the most advanced AI models.
Future Directions: The study suggests that future models must account for structural isolation (e.g., via better data augmentation or transfer learning) and that integrating in vivo data or mechanistic pathways (AOPs) is essential for improving predictions of complex endpoints like antagonism.

Overall, the paper moves beyond simple accuracy metrics to provide a nuanced understanding of how data characteristics (imbalance, structural topology) interact with AI architectures to determine success in toxicity prediction.