q-bio.GN papers | Gist.Science

Quantifying Memorization and Privacy Risks in Genomic Language Models

This paper introduces a comprehensive multi-vector privacy evaluation framework that quantifies memorization risks in Genomic Language Models by integrating perplexity-based detection, canary sequence extraction, and membership inference, revealing that these models exhibit measurable data leakage dependent on architecture and training dynamics.

Alexander Nemecek, Wenbiao Li, Xiaoqian Jiang, Jaideep Vaidya, Erman AydayWed, 11 Ma🤖 cs.LG

Controllable Sequence Editing for Biological and Clinical Trajectories

This paper introduces CLEF, a controllable sequence editing framework that learns temporal concepts to precisely target the timing and scope of interventions in longitudinal data, significantly outperforming state-of-the-art baselines in generating accurate and realistic counterfactual trajectories for biological and clinical applications.

Michelle M. Li, Kevin Li, Yasha Ektefaie, Ying Jin, Yepeng Huang, Shvat Messica, Tianxi Cai, Marinka ZitnikTue, 10 Ma🤖 cs.LG

How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

This study demonstrates that DNA foundation models (DNABERT-2, Evo 2, and NTv2) are vulnerable to model inversion attacks, where adversaries can reconstruct sensitive genomic sequences from shared embeddings with high accuracy, particularly for shorter sequences and per-token representations, thereby highlighting critical privacy risks in Embeddings-as-a-Service frameworks.

Sofiane Ouaari, Jules Kreuer, Nico PfeiferTue, 10 Ma🤖 cs.LG

Adversarial Domain Adaptation Enables Knowledge Transfer Across Heterogeneous RNA-Seq Datasets

This study proposes an adversarial deep learning framework that enables effective knowledge transfer across heterogeneous RNA-seq datasets by learning a domain-invariant latent space, thereby significantly improving cancer and tissue type classification accuracy, especially in low-data scenarios.

Kevin Dradjat, Massinissa Hamidi, Blaise HanczarTue, 10 Ma🤖 cs.LG

Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies

This paper proposes a novel summary-statistics-based joint analysis method that controls the joint local false discovery rate (Jlfdr), demonstrating through simulations and empirical data that it offers superior power over traditional meta-analysis methods, particularly when analyzing heterogeneous genome-wide association study datasets.

Wei Jiang, Weichuan YuThu, 12 Ma📊 stat

Estimating Reproducibility in Genome-Wide Association Studies

This paper proposes two probabilistic measures, Reproducibility Rate (RR) and False Irreproducibility Rate (FIR), to quantitatively evaluate the behavior of primary positive associations in replication studies, offering tools to guide study design and identify potentially true associations among irreproducible findings.

Wei Jiang, Jing-Hao Xue, Weichuan YuThu, 12 Ma📊 stat

pHapCompass: Probabilistic Assembly and Uncertainty Quantification of Polyploid Haplotype Phase

The paper introduces pHapCompass, a probabilistic algorithm for assembling and quantifying uncertainty in polyploid haplotypes that addresses read assignment ambiguity through graph-theoretic inference, while also providing a realistic simulation workflow and generalized evaluation metrics to demonstrate its competitive performance against existing assemblers.

Marjan Hosseini (School of Computing, University of Connecticut), Ella Veiner (School of Computing, University of Connecticut), Thomas Bergendahl (School of Computing, University of Connecticut), Tala Yasenpoor (School of Computing, University of Connecticut), Zane Smith (Department of Entomology and Plant Pathology, University of Tennessee), Margaret Staton (Department of Entomology and Plant Pathology, University of Tennessee), Derek Aguiar (School of Computing, University of Connecticut, Institute for Systems Genomics, University of Connecticut)Thu, 12 Ma🧬 q-bio

Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

This paper introduces a parameter-efficient Diffusion Transformer (DiT) with a 2D CNN encoder that generates high-quality, cell-type-specific synthetic regulatory DNA sequences with significantly faster convergence, reduced memorization, and enhanced regulatory activity compared to existing U-Net-based models.

Jonathan Liu, Kia GhodsThu, 12 Ma🧬 q-bio

SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

SNPgen is a two-stage conditional latent diffusion framework that generates privacy-preserving, phenotype-aligned synthetic genotype data, enabling machine learning models trained on synthetic samples to achieve predictive performance comparable to those trained on real data while maintaining strict privacy guarantees and preserving key genetic structures.

Andrea Lampis, Michela Carlotta Massi, Nicola Pirastu, Francesca Ieva, Matteo Matteucci, Emanuele Di AngelantonioThu, 12 Ma🧬 q-bio

Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals

This paper introduces a novel three-stage mechanistic interpretability method that extracts a compact, high-performing hematopoietic algorithm directly from the internal attention weights of the scGPT foundation model, achieving superior zero-shot classification and pseudotime ordering on independent datasets with significantly fewer parameters and training time than standard probing or retraining approaches.

Ihor KendiukhovThu, 12 Ma🧬 q-bio

Omics Data Discovery Agents

This paper presents an agentic framework leveraging large language models and containerized tools to automatically retrieve, extract, and re-analyze omics data from biomedical literature, thereby transforming static publications into a scalable, executable resource for automated data reuse and cross-study discovery.

Alexandre Hutton, Jesse G. MeyerThu, 12 Ma🧬 q-bio

TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling

TrinityDNA is a novel, bio-inspired foundational model that integrates structural feature capture, symmetry handling, multi-scale attention, and evolutionary training to efficiently model long DNA sequences, significantly advancing gene function prediction and regulatory discovery while introducing a new long-sequence CDS annotation benchmark.

Qirong Yang, Yucheng Guo, Zicheng Liu, Yujie Yang, Qijin Yin, Siyuan Li, Shaomin Ji, Linlin Chao, Xiaoming Zhang, Stan Z. LiMon, 09 Ma💻 cs

What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses

Through an autonomous AI-driven screening of 141 hypotheses, this study demonstrates that biological foundation models like scGPT and Geneformer learn genuine, shared geometric and topological structures in their internal representations that are biologically meaningful yet more localized to specific tissues like immune cells than previously assumed.

Ihor KendiukhovMon, 09 Ma🤖 cs.LG

Validating Interpretability in siRNA Efficacy Prediction: A Perturbation-Based, Dataset-Aware Protocol

This paper introduces a perturbation-based validation protocol to ensure the faithfulness of saliency maps in siRNA efficacy prediction, revealing critical failure modes across datasets and proposing a biology-informed regularizer to enhance the reliability of explanation-guided therapeutic design.

Zahra Khodagholi, Niloofar YousefiMon, 09 Ma🤖 cs.LG

Machine Learning for analysis of Multiple Sclerosis cross-tissue bulk and single-cell transcriptomics data

This study presents an end-to-end machine learning pipeline utilizing XGBoost and SHAP explainability to integrate bulk and single-cell transcriptomic data from multiple sclerosis patients, successfully identifying high-performance biomarkers and novel mechanistic pathways involving immune activation, non-canonical checkpoints, and Epstein-Barr virus-related processes.

Francesco Massafra, Samuele Punzo, Silvia Giulia Galfré, Alessandro Maglione, Simone Pernice, Stefano Forti, Simona Rolla, Marco Beccuti, Marinella Clerico, Corrado Priami, Alina SîrbuMon, 09 Ma🤖 cs.LG

LA-MARRVEL: A Knowledge-Grounded, Language-Aware LLM Framework for Clinically Robust Rare Disease Gene Prioritization

LA-MARRVEL is a knowledge-grounded, language-aware LLM framework that significantly improves rare disease gene prioritization accuracy by using structured, phenotype-rich prompts to generate clinically robust, ACMG-aligned reasoning without disrupting existing diagnostic pipelines.

Jaeyeon Lee, Lin Yao, Hyun-Hwan Jeong, Zhandong LiuMon, 09 Ma🤖 cs.AI

Identifying genes associated with phenotypes using machine and deep learning

This paper proposes a machine and deep learning pipeline that classifies individuals based on genotype data and utilizes feature importance to identify phenotype-associated genes, demonstrating that SNPs selected by high-performing models effectively prioritize disease-related genetic markers with a mean gene identification ratio of 0.84.

Muhammad Muneeb, David B. Ascher, YooChan Myung2026-03-10🧬 q-bio

Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools

This study benchmarks the performance of 29 machine learning algorithms, 80 deep learning models, and 3 polygenic risk score tools across 80 binary phenotypes from the openSNP dataset, revealing that machine learning approaches outperformed traditional tools for 44 phenotypes while polygenic risk scores were superior for the remaining 36.

Muhammad Muneeb, David B. Ascher, YooChan Myung + 2 more2026-03-10🧬 q-bio

DeeDeeExperiment: Building an infrastructure for integrating and managing omics data analysis results in R/Bioconductor

The paper introduces DeeDeeExperiment, a new S4 class within the Bioconductor ecosystem that extends SingleCellExperiment to provide a standardized, reproducible infrastructure for storing, managing, and contextualizing differential expression and functional enrichment analysis results alongside their metadata.

Najla Abassi, Lea Schwarz, Edoardo Filippi + 1 more2026-03-10🧬 q-bio

Partial domain adaptation enables cross domain cell type annotation between scRNA-seq and snRNA-seq

The paper introduces ScNucAdapt, a partial domain adaptation method that enables robust and accurate cross-domain cell type annotation between scRNA-seq and snRNA-seq datasets, outperforming existing approaches by addressing distributional and compositional differences.

Xiran Chen, Quan Zou, Qinyu Cai + 3 more2026-03-10🧬 q-bio