Quantifying Memorization and Privacy Risks in Genomic Language Models

This paper introduces a comprehensive multi-vector privacy evaluation framework that quantifies memorization risks in Genomic Language Models by integrating perplexity-based detection, canary sequence extraction, and membership inference, revealing that these models exhibit measurable data leakage dependent on architecture and training dynamics.

Alexander Nemecek, Wenbiao Li, Xiaoqian Jiang, Jaideep Vaidya, Erman AydayWed, 11 Ma🤖 cs.LG

Controllable Sequence Editing for Biological and Clinical Trajectories

This paper introduces CLEF, a controllable sequence editing framework that learns temporal concepts to precisely target the timing and scope of interventions in longitudinal data, significantly outperforming state-of-the-art baselines in generating accurate and realistic counterfactual trajectories for biological and clinical applications.

Michelle M. Li, Kevin Li, Yasha Ektefaie, Ying Jin, Yepeng Huang, Shvat Messica, Tianxi Cai, Marinka ZitnikTue, 10 Ma🤖 cs.LG

How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

This study demonstrates that DNA foundation models (DNABERT-2, Evo 2, and NTv2) are vulnerable to model inversion attacks, where adversaries can reconstruct sensitive genomic sequences from shared embeddings with high accuracy, particularly for shorter sequences and per-token representations, thereby highlighting critical privacy risks in Embeddings-as-a-Service frameworks.

Sofiane Ouaari, Jules Kreuer, Nico PfeiferTue, 10 Ma🤖 cs.LG

Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies

This paper proposes a novel summary-statistics-based joint analysis method that controls the joint local false discovery rate (Jlfdr), demonstrating through simulations and empirical data that it offers superior power over traditional meta-analysis methods, particularly when analyzing heterogeneous genome-wide association study datasets.

Wei Jiang, Weichuan YuThu, 12 Ma📊 stat

pHapCompass: Probabilistic Assembly and Uncertainty Quantification of Polyploid Haplotype Phase

The paper introduces pHapCompass, a probabilistic algorithm for assembling and quantifying uncertainty in polyploid haplotypes that addresses read assignment ambiguity through graph-theoretic inference, while also providing a realistic simulation workflow and generalized evaluation metrics to demonstrate its competitive performance against existing assemblers.

Marjan Hosseini (School of Computing, University of Connecticut), Ella Veiner (School of Computing, University of Connecticut), Thomas Bergendahl (School of Computing, University of Connecticut), Tala Yasenpoor (School of Computing, University of Connecticut), Zane Smith (Department of Entomology and Plant Pathology, University of Tennessee), Margaret Staton (Department of Entomology and Plant Pathology, University of Tennessee), Derek Aguiar (School of Computing, University of Connecticut, Institute for Systems Genomics, University of Connecticut)Thu, 12 Ma🧬 q-bio

SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

SNPgen is a two-stage conditional latent diffusion framework that generates privacy-preserving, phenotype-aligned synthetic genotype data, enabling machine learning models trained on synthetic samples to achieve predictive performance comparable to those trained on real data while maintaining strict privacy guarantees and preserving key genetic structures.

Andrea Lampis, Michela Carlotta Massi, Nicola Pirastu, Francesca Ieva, Matteo Matteucci, Emanuele Di AngelantonioThu, 12 Ma🧬 q-bio

Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals

This paper introduces a novel three-stage mechanistic interpretability method that extracts a compact, high-performing hematopoietic algorithm directly from the internal attention weights of the scGPT foundation model, achieving superior zero-shot classification and pseudotime ordering on independent datasets with significantly fewer parameters and training time than standard probing or retraining approaches.

Ihor KendiukhovThu, 12 Ma🧬 q-bio

TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling

TrinityDNA is a novel, bio-inspired foundational model that integrates structural feature capture, symmetry handling, multi-scale attention, and evolutionary training to efficiently model long DNA sequences, significantly advancing gene function prediction and regulatory discovery while introducing a new long-sequence CDS annotation benchmark.

Qirong Yang, Yucheng Guo, Zicheng Liu, Yujie Yang, Qijin Yin, Siyuan Li, Shaomin Ji, Linlin Chao, Xiaoming Zhang, Stan Z. LiMon, 09 Ma💻 cs

Machine Learning for analysis of Multiple Sclerosis cross-tissue bulk and single-cell transcriptomics data

This study presents an end-to-end machine learning pipeline utilizing XGBoost and SHAP explainability to integrate bulk and single-cell transcriptomic data from multiple sclerosis patients, successfully identifying high-performance biomarkers and novel mechanistic pathways involving immune activation, non-canonical checkpoints, and Epstein-Barr virus-related processes.

Francesco Massafra, Samuele Punzo, Silvia Giulia Galfré, Alessandro Maglione, Simone Pernice, Stefano Forti, Simona Rolla, Marco Beccuti, Marinella Clerico, Corrado Priami, Alina SîrbuMon, 09 Ma🤖 cs.LG

Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools

This study benchmarks the performance of 29 machine learning algorithms, 80 deep learning models, and 3 polygenic risk score tools across 80 binary phenotypes from the openSNP dataset, revealing that machine learning approaches outperformed traditional tools for 44 phenotypes while polygenic risk scores were superior for the remaining 36.

Muhammad Muneeb, David B. Ascher, YooChan Myung + 2 more2026-03-10🧬 q-bio

DeeDeeExperiment: Building an infrastructure for integrating and managing omics data analysis results in R/Bioconductor

The paper introduces DeeDeeExperiment, a new S4 class within the Bioconductor ecosystem that extends SingleCellExperiment to provide a standardized, reproducible infrastructure for storing, managing, and contextualizing differential expression and functional enrichment analysis results alongside their metadata.

Najla Abassi, Lea Schwarz, Edoardo Filippi + 1 more2026-03-10🧬 q-bio