cs.AI papers | Gist.Science

Expert Evaluation of LLM World Models: A High- $T_c$ Superconductivity Case Study

This study evaluates the ability of six LLM-based systems to answer expert-level questions about high-temperature superconductivity using a curated database of 1,726 papers, finding that retrieval-augmented generation (RAG) systems outperform closed models in providing comprehensive, well-supported answers while highlighting both the potential and current limitations of LLMs in specialized scientific domains.

Haoyu Guo, Maria Tikhanovskaya, Paul Raccuglia + 20 more2026-03-12🤖 cs.AI

DeepEyesV2: Toward Agentic Multimodal Model

This paper introduces DeepEyesV2, an agentic multimodal model that employs a two-stage training pipeline combining cold-start data curation and reinforcement learning to effectively integrate external tools like code execution and web search for complex real-world reasoning tasks.

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu2026-03-12🤖 cs.AI

What We Don't C: Manifold Disentanglement for Structured Discovery

The paper introduces "What We Don't C," a novel latent flow matching approach that disentangles latent subspaces by explicitly removing information from conditional guidance to generate meaningful residual representations, thereby enabling the discovery and analysis of factors of variation not captured in the conditioning variables.

Brian Rogers, Micah Bowles, Chris J. Lintott, Steve Croft, Oliver N. F. King, James Kostas Ray2026-03-12🤖 cs.AI

D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Frequency and Pixel Spaces

The paper proposes D-GAP, a dataset-agnostic and gradient-guided augmentation method that adaptively blends frequency amplitudes and pixel values to reduce domain-specific learning biases and restore spatial details, thereby significantly improving out-of-domain robustness in computer vision models.

Ruoqi Wang, Haitao Wang, Shaojie Guo, Qiong Luo2026-03-12🤖 cs.AI

STREAM-VAE: Dual-Path Routing for Slow and Fast Dynamics in Vehicle Telemetry Anomaly Detection

This paper introduces STREAM-VAE, a dual-path variational autoencoder that separates slow drifts and fast spikes in vehicle telemetry data to overcome the limitations of standard reconstruction-based methods and achieve robust anomaly detection across diverse operating modes.

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler2026-03-12🤖 cs.LG

REMSA: Foundation Model Selection for Remote Sensing via a Constraint-Aware Agent

This paper introduces REMSA, a constraint-aware agent built upon the newly constructed RSFM Database (RS-FMD) that automates the selection of suitable remote sensing foundation models from natural language queries by integrating structured metadata retrieval with task-driven decision workflows, achieving superior performance over baselines in a novel expert-verified benchmark.

Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir2026-03-12🤖 cs.AI

Hierarchical Dual-Strategy Unlearning for Biomedical and Healthcare Intelligence Using Imperfect and Privacy-Sensitive Medical Data

This paper proposes a hierarchical dual-strategy framework that achieves precise selective unlearning of privacy-sensitive medical knowledge in large language models while preserving fundamental competencies, demonstrated by high forgetting and preservation rates on clinical datasets with minimal parameter modification.

Yi Zhang, Chao Zhang, Zijian Li, Tianxiang Xu, Kunyu Zhang, Zhan Gao, Meinuo Li, Xiaohan Zhang, Qichao Qi, Bing Chen2026-03-12🤖 cs.LG

CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents

This paper introduces CostNav, the first physics-grounded navigation benchmark that evaluates autonomous agents using real-world economic data to reveal that current methods, despite varying in hardware and architecture, all fail to achieve economic viability due to negative contribution margins.

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Samwoo Seong, Youngjae Yu, Yunsung Lee2026-03-12🤖 cs.AI

IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch

This paper introduces IndiMathBench, a human-verified benchmark of 312 formal Lean 4 theorems derived from Indian Mathematics Olympiads, which utilizes an AI-powered human-assisted pipeline to address the scarcity of high-quality training data and reveals significant challenges in current autoformalization and theorem proving capabilities.

Param Biyani, Shashank Kirtania, Yasharth Bajpai, Sumit Gulwani, Ashish Tiwari2026-03-12🤖 cs.AI

World Models That Know When They Don't Know - Controllable Video Generation with Calibrated Uncertainty

This paper proposes C3, a novel uncertainty quantification method that trains controllable video models to generate high-resolution, calibrated confidence heatmaps at the subpatch level by estimating uncertainty in latent space and using strictly proper scoring rules, thereby enabling reliable hallucination detection and out-of-distribution identification for robotics applications.

Zhiting Mei, Tenny Yin, Micah Baker, Ola Shorinwa, Anirudha Majumdar2026-03-12🤖 cs.AI

Toward Closed-loop Molecular Discovery via Language Model, Property Alignment and Strategic Search

The paper introduces Trio, a closed-loop molecular generation framework that integrates fragment-based language modeling, reinforcement learning, and Monte Carlo tree search to produce chemically valid, diverse, and pharmacologically optimized ligands with significantly improved binding affinity, drug-likeness, and synthetic accessibility compared to state-of-the-art methods.

Junkai Ji, Zhangfan Yang, Dong Xu, Ruibin Bai, Jianqiang Li, Tingjun Hou, Zexuan Zhu2026-03-12🤖 cs.AI

Maximum Risk Minimization with Random Forests

This paper introduces computationally efficient and statistically consistent Random Forest variants that minimize the maximum risk across diverse environments to improve out-of-distribution generalization, offering novel guarantees for mean squared error, negative reward, and regret-based risks.

Francesco Freni, Anya Fries, Linus Kühne, Markus Reichstein, Jonas Peters2026-03-12📊 stat

GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

GTR-Turbo is a highly efficient training method for multi-modal agents that eliminates the need for costly external teacher models by using merged checkpoints from ongoing reinforcement learning as a "free" teacher, thereby improving accuracy by 10–30% while reducing training time and compute costs by 50% and 60%, respectively.

Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye2026-03-12🤖 cs.AI

Pretrained battery transformer (PBT): A foundation model for universal battery life prediction

This paper introduces the Pretrained Battery Transformer (PBT), a foundation model that leverages battery-knowledge-encoded mixture-of-experts layers to overcome data scarcity and heterogeneity, achieving state-of-the-art universal battery life prediction across diverse chemistries and conditions.

Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang2026-03-12🤖 cs.LG

Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections

This paper presents a framework using YOLOv8 and Finer-CAM to achieve 96% accuracy in classifying seven European tree species from TLS 3D point clouds while demonstrating that the model's decisions are interpretable and primarily rely on crown features for most species, with stems playing a more significant role for others.

Adrian Straker, Paul Magdon, Marco Zullich, Maximilian Freudenberg, Christoph Kleinn, Johannes Breidenbach, Stefano Puliti, Nils Noelke2026-03-12🤖 cs.AI

The Bayesian Geometry of Transformer Attention

This paper introduces "Bayesian wind tunnels" to rigorously demonstrate that small transformers perform exact Bayesian inference through a specific geometric mechanism involving residual streams as belief substrates and attention-based routing, a capability that capacity-matched MLPs fundamentally lack.

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra2026-03-12📊 stat

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

This paper provides a first-order analysis demonstrating that cross-entropy training in transformers induces a coupled specialization of attention routing and value updates—functioning as a two-timescale EM procedure—that sculpts low-dimensional Bayesian manifolds, thereby explaining how gradient-based optimization enables precise probabilistic reasoning.

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra2026-03-12📊 stat

Geometric Scaling of Bayesian Inference in LLMs

This paper demonstrates that production-grade language models preserve a low-dimensional geometric substrate, specifically an entropy-aligned axis in their last-layer value representations, which encodes Bayesian posterior structures and serves as a privileged readout for uncertainty, even though it is not a singular computational bottleneck for inference.

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra2026-03-12🤖 cs.LG

Over-Searching in Search-Augmented Large Language Models

This paper systematically evaluates the phenomenon of "over-searching" in search-augmented large language models, where unnecessary tool invocation harms efficiency and accuracy, and proposes the Tokens Per Correctness (TPC) metric along with mitigation strategies to address this issue.

Roy Xie, Deepak Gopinath, David Qiu, Dong Lin, Haitian Sun, Saloni Potdar, Bhuwan Dhingra2026-03-12🤖 cs.LG

Burn-After-Use for Preventing Data Leakage through a Secure Multi-Tenant Architecture in Enterprise LLM

This paper proposes a Secure Multi-Tenant Architecture (SMTA) combined with a novel Burn-After-Use (BAU) mechanism to effectively prevent data leakage in enterprise LLMs by enforcing strict instance isolation and ephemeral context destruction, achieving high defense success rates against both semantic leakage attacks and post-session persistence threats in experimental evaluations.

Qiang Zhang, Elena Emma Wang, Jiaming Li, Xichun Wang2026-03-12🤖 cs.AI

← Previous Next →

cs.AI

Expert Evaluation of LLM World Models: A High-TcT_cTc​ Superconductivity Case Study