cs.AI papers | Gist.Science

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

This paper introduces CCR-Bench, a novel benchmark designed to rigorously evaluate large language models on complex, real-world industrial tasks involving entangled content-format requirements and intricate logical workflows, revealing significant performance gaps in even state-of-the-art models.

Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan Feng2026-03-10💬 cs.CL

Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

This paper introduces a particle filtering framework to rigorously analyze the accuracy-cost tradeoffs of parallel inference methods in large language models, establishing theoretical guarantees and identifying fundamental limits while demonstrating that sampling error alone does not fully predict final model accuracy.

Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, Akshay Krishnamurthy2026-03-10🤖 cs.LG

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

This paper introduces VLM-SubtleBench, a comprehensive benchmark spanning ten fine-grained difference types across diverse domains like industrial, medical, and aerial imagery, to evaluate and reveal the significant performance gaps between current vision-language models and humans in subtle comparative reasoning tasks.

Minkyu Kim, Sangheon Lee, Dongmin Park2026-03-10🤖 cs.LG

Visualizing Coalition Formation: From Hedonic Games to Image Segmentation

This paper proposes using image segmentation as a visual diagnostic framework for hedonic games, demonstrating how granularization parameters influence coalition equilibrium structures and their ability to recover foreground ground-truth on the Weizmann benchmark.

Pedro Henrique de Paula França, Lucas Lopes Felipe, Daniel Sadoc Menasché2026-03-10💻 cs

A Lightweight Traffic Map for Efficient Anytime LaCAM*

This paper proposes a new approach that leverages LaCAM*'s ability to construct a dynamic, lightweight traffic map during search, overcoming the computational overhead and static limitations of existing Frank-Wolfe-based guidance path methods to achieve higher solution quality in Multi-Agent Path Finding.

Bojie Shen, Yue Zhang, Zhe Chen, Daniel Harabor2026-03-10💻 cs

Designing probabilistic AI monsoon forecasts to inform agricultural decision-making

This paper presents a decision-theory framework and a blended AI-statistical forecasting system that successfully delivered skillful, tailored monsoon onset predictions to 38 million Indian farmers in 2025, enabling better agricultural decision-making under uncertainty.

Colin Aitken, Rajat Masiwal, Adam Marchakitus, Katherine Kowal, Mayank Gupta, Tyler Yang, Amir Jina, Pedram Hassanzadeh, William R. Boos, Michael Kremer2026-03-10🤖 cs.LG

SMGI: A Structural Theory of General Artificial Intelligence

This paper introduces SMGI, a structural theory of general artificial intelligence that formalizes learning as the controlled evolution of a typed meta-model, unifying diverse existing approaches under a rigorous framework defined by structural closure, dynamical stability, bounded capacity, and evaluative invariance.

Aomar Osmani2026-03-10🤖 cs.LG

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

EveryQuery is a novel electronic health record foundation model that achieves efficient, zero-shot clinical prediction by directly estimating outcome likelihoods through task-conditioned pre-training, thereby outperforming computationally expensive autoregressive baselines—particularly for rare events—while currently facing limitations in complex disjunctive reasoning tasks.

Payal Chandak, Gregory Kondas, Isaac Kohane, Matthew McDermott2026-03-10💻 cs

Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

This paper presents a vision-only autonomous bronchoscopy framework utilizing hierarchical long-short agents and a world-model critic to achieve accurate, sensor-free intraoperative navigation in preclinical models, demonstrating performance comparable to expert human operators.

Junyang Wu, Mingyi Luo, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Chunxi Zhang, Junhao Wang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang2026-03-10💻 cs

Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents

The paper introduces Ares, a framework that dynamically selects the optimal reasoning effort level for each step of an LLM agent's task using a lightweight router, achieving up to 52.7% reduction in token usage with minimal impact on success rates compared to static high-effort strategies.

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang2026-03-10💻 cs

Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases

This paper introduces Rel-MOSS, a novel relation-centric deep learning framework that addresses the critical issue of class imbalance in relational databases by employing a relation-wise gating controller and a relation-guided minority synthesizer to enhance the representation and over-sampling of minority entities, thereby significantly outperforming existing methods in entity classification tasks.

Jun Yin, Peng Huo, Bangguo Zhu, Hao Yan, Senzhang Wang, Shirui Pan, Chengqi Zhang2026-03-10🤖 cs.LG

IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation

The paper proposes IMSE, a test-time adaptation method that fine-tunes only the singular values of Vision Transformer linear layers via a spectral mixture of experts and a diversity maximization loss to prevent feature collapse, achieving state-of-the-art performance with significantly fewer trainable parameters.

Sunghyun Baek (Korea Advanced Institute of Science and Technology), Jaemyung Yu (Korea Advanced Institute of Science and Technology), Seunghee Koh (Korea Advanced Institute of Science and Technology), Minsu Kim (LG Energy Solution), Hyeonseong Jeon (LG Energy Solution), Junmo Kim (Korea Advanced Institute of Science and Technology)2026-03-10💻 cs

SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training

SWE-Fuse is a novel training framework that enhances software engineering agents by fusing issue-free trajectory learning with entropy-aware RLVR to overcome the limitations of noisy real-world issue descriptions, achieving state-of-the-art performance on the SWE-bench Verified benchmark.

Xin-Cheng Wen, Binbin Chen, Haoxuan Lan, Hang Yu, Peng Di, Cuiyun Gao2026-03-10💻 cs

AI Agents, Language, Deep Learning and the Next Revolution in Science

This paper proposes that intelligent, human-supervised AI agents built on deep learning and large language models represent the next evolution of the scientific method, enabling researchers to manage unprecedented data complexity and scale discovery, as demonstrated by the Dr. Sai system in particle physics.

Ke Li, Beijiang Liu, Bruce Mellado, Changzheng Yuan, Zhengde Zhang2026-03-10💻 cs

ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework

This paper introduces ELLMob, a self-aligned Large Language Model framework that leverages Fuzzy-Trace Theory to reconcile habitual patterns with event constraints, addressing the lack of event-annotated datasets and significantly improving the generation of human mobility trajectories during major societal events like typhoons, pandemics, and the Olympics.

Yusong Wang, Chuang Yang, Jiawei Wang, Xiaohang Xu, Jiayi Xu, Dongyuan Li, Chuan Xiao, Renhe Jiang2026-03-10🤖 cs.LG

PSTNet: Physically-Structured Turbulence Network

This paper introduces PSTNet, a lightweight, physics-structured neural network with only 552 parameters that embeds atmospheric turbulence scaling laws directly into its architecture to enable accurate, real-time turbulence estimation on resource-constrained aircraft guidance systems where traditional models fail.

Boris Kriuk, Fedor Kriuk2026-03-10🤖 cs.LG

Advancing Automated Algorithm Design via Evolutionary Stagewise Design with LLMs

This paper introduces EvoStage, a novel evolutionary paradigm that leverages large language models with a stagewise, multi-agent approach and real-time feedback to overcome the limitations of black-box modeling, successfully generating algorithm designs that outperform both human experts and existing methods in complex industrial tasks like chip placement and black-box optimization.

Chen Lu, Ke Xue, Chengrui Gao, Yunqi Shi, Siyuan Xu, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou2026-03-10💻 cs

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

This paper introduces HILA, a Human-In-the-Loop Multi-Agent Collaboration framework that employs Dual-Loop Policy Optimization to train agents with metacognitive policies for dynamically deferring to human experts and continuously improving their reasoning capabilities, thereby overcoming the static knowledge limitations of purely autonomous systems.

Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu2026-03-10💻 cs

VORL-EXPLORE: A Hybrid Learning Planning Approach to Multi-Robot Exploration in Dynamic Environments

VORL-EXPLORE is a hybrid learning and planning framework for multi-robot exploration in dynamic environments that couples task allocation with motion execution via a shared navigability fidelity signal, enabling adaptive arbitration between global and reactive policies to prevent bottlenecks and ensure robust, collision-free coverage.

Ning Liu, Sen Shen, Zheng Li, Sheng Liu, Dongkun Han, Shangke Lyu, Thomas Braunl2026-03-10💻 cs

OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

The paper introduces OSExpert, a computer-use agent that leverages a GUI-based depth-first search exploration algorithm to discover action primitives and self-construct a skill curriculum, thereby significantly improving performance and efficiency on complex tasks to approach human expert levels.

Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, Heng Ji2026-03-10💻 cs

← Previous Next →