cs.AI papers | Gist.Science

Meta-RL Induces Exploration in Language Agents

The paper introduces LaMer, a Meta-RL framework that enhances language agents' ability to actively explore and adapt to novel environments at test time through cross-episode training and in-context policy reflection, significantly outperforming standard RL baselines across diverse tasks.

Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic2026-03-10🤖 cs.LG

ReDepth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Re-Depth Anything is a test-time self-supervised framework that enhances monocular depth estimation by fusing foundation models with large-scale 2D diffusion priors to perform label-free refinement via generative re-lighting and Score Distillation Sampling, achieving state-of-the-art results without direct depth tensor optimization.

Ananta R. Bhattarai, Helge Rhodin2026-03-10🤖 cs.LG

Cost Trade-offs of Reasoning and Non-Reasoning Large Language Models in Text-to-SQL

This paper demonstrates that reasoning Large Language Models significantly reduce cloud query execution costs and data consumption compared to non-reasoning models in Text-to-SQL tasks, while revealing that execution time is a poor proxy for cost efficiency and highlighting the substantial financial risks posed by non-reasoning models' tendency to generate inefficient queries.

Saurabh Deochake, Debajyoti Mukhopadhyay2026-03-10💻 cs

Physics-Informed Neural Networks for Device and Circuit Modeling: A Case Study of NeuroSPICE

This paper introduces NeuroSPICE, a physics-informed neural network framework that solves circuit differential-algebraic equations via backpropagation to provide a flexible alternative to conventional SPICE for simulating emerging nonlinear devices and addressing inverse problems, despite not surpassing traditional solvers in raw speed or accuracy.

Chien-Ting Tung, Chenming Hu2026-03-10🔬 physics.app-ph

Toward a Physical Theory of Intelligence

This paper introduces the Conservation-Congruent Encoding (CCE) framework, a unified physical theory that defines intelligence as an irreversible process of extracting work while minimizing dissipation, thereby deriving universal computational bounds and linking thermodynamic measurement, quantum decoherence, and spacetime geometry to establish substrate-neutral constraints for both natural and artificial intelligence.

Peter David Fagan2026-03-10💻 cs

Reliable Grid Forecasting: State Space Models for Safety-Critical Energy Systems

This paper introduces an operator-legible evaluation framework centered on under-prediction risk to demonstrate that standard accuracy metrics fail to capture safety-critical grid forecasting needs, revealing that while explicit weather integration improves reliability, unconstrained probabilistic models often induce "fake safety" through excessive inflation, a problem solved by new Bias/OPR-constrained objectives.

Sunki Hong, Jisoo Lee2026-03-10⚡ eess

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

This paper introduces DrivingGen, the first comprehensive benchmark for generative driving world models that addresses the lack of rigorous evaluation by combining a diverse dataset with a novel suite of metrics to assess visual realism, trajectory plausibility, temporal coherence, and controllability, thereby revealing critical trade-offs in current state-of-the-art models.

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, Steven L. Waslander2026-03-10💻 cs

Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning

This paper introduces Batch-of-Thought (BoT), a training-free method that enhances Large Language Model reasoning by jointly processing related queries to leverage cross-instance signals, thereby improving accuracy, calibration, and computational efficiency through a multi-agent reflection architecture.

Xuan Yang, Furong Jia, Roy Xie, Xiong Xi, Hengwei Bian, Jian Li, Monica Agrawal2026-03-10💻 cs

NC-Bench: An LLM Benchmark for Evaluating Conversational Competence

NC-Bench introduces a theory-grounded benchmark that evaluates the conversational competence of large language models by assessing their ability to manage the form and structure of natural interactions across basic, retrieval-augmented, and complex multi-turn scenarios, revealing that while models excel at basic answering, they struggle significantly with repair and complex sequence management tasks.

Robert J. Moore, Sungeun An, Farhan Ahmed, Jay Pankaj Gala2026-03-10💬 cs.CL

The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor

This paper audits the LAION-Aesthetics Predictor to reveal how its algorithmic gaze reinforces Western, male, and imperial biases by disproportionately filtering content and prioritizing specific cultural aesthetics, ultimately urging a shift toward pluralistic evaluation methods in AI development.

Jordan Taylor, William Agnew, Maarten Sap, Sarah E. Fox, Haiyi Zhu2026-03-10💻 cs

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

This paper introduces "Single-Shot Planning," a secure architecture for Computer Use Agents that generates a complete, trusted execution graph before observing untrusted UI states to effectively mitigate prompt injection and branch steering attacks while maintaining competitive task performance.

Hanna Foerster, Tom Blanchard, Kristina Nikolic, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian Tramèr, Yiren Zhao2026-03-10💻 cs

BoxMind: Closed-loop AI strategy optimization for elite boxing validated in the 2024 Olympics

This paper introduces BoxMind, a closed-loop AI system that transforms unstructured boxing footage into hierarchical tactical indicators and predictive gradients to generate expert-level strategic recommendations, which were validated during the 2024 Paris Olympics by contributing to the Chinese National Team's historic medal success.

Kaiwen Wang, Kaili Zheng, Rongrong Deng, Qingmin Fan, Milin Zhang, Zongrui Li, Xuesi Zhou, Bo Han, Liren Chen, Chenyi Guo, Ji Wu2026-03-10💻 cs

Multifaceted Scenario-Aware Hypergraph Learning for Next POI Recommendation

This paper proposes the Multifaceted Scenario-Aware Hypergraph Learning (MSAHG) framework, which addresses the limitations of existing methods in handling mobility variations across distinct contexts by constructing scenario-specific disentangled sub-hypergraphs and employing a parameter-splitting mechanism to resolve inter-scenario conflicts, thereby significantly improving next POI recommendation performance.

Yuxi Lin, Yongkang Li, Jie Xing, Zipei Fan2026-03-10💻 cs

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a realistic, telemetry-driven benchmark comprising 1,800 instances across six languages that evaluates LLMs on code completion tasks with a focus on ecological validity, contamination-free assessment, and detailed diagnostic insights to guide practical model selection and development.

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, Elsie Nallipogu2026-03-10🤖 cs.LG

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

This paper introduces MAS-Orchestra, a training-time framework that optimizes multi-agent system orchestration via function-calling reinforcement learning, alongside the MASBENCH benchmark, to demonstrate that multi-agent benefits are task-dependent and to achieve significant performance gains with over 10x efficiency on complex reasoning tasks.

Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq Joty2026-03-10💬 cs.CL

Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

This paper introduces the Determinism-Faithfulness Assurance Harness (DFAH), a framework and set of financial benchmarks demonstrating that decision determinism and task accuracy in LLM agents are uncorrelated, thereby necessitating independent measurement to ensure reliable regulatory audit replay in financial services.

Raffi Khatchadourian2026-03-10💬 cs.CL

Continuous-Flow Data-Rate-Aware CNN Inference on FPGA

This paper proposes a novel data-rate-aware continuous-flow architecture for CNN inference on FPGAs that mitigates hardware underutilization caused by data reduction in pooling and strided convolution layers by interleaving signals and sharing resources, thereby enabling the high-throughput implementation of complex models like MobileNet on a single device.

Tobias Habermann, Michael Mecik, Zhenyu Wang, César David Vera, Martin Kumm, Mario Garrido2026-03-10🤖 cs.LG

MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

MeanCache is a training-free framework that accelerates Flow Matching inference by replacing instantaneous velocity caching with an average-velocity approach using cached Jacobian-vector products and a trajectory-stability scheduling strategy, achieving significant speedups (up to 4.56X) while maintaining high generation quality across models like FLUX.1 and HunyuanVideo.

Huanlin Gao, Ping Chen, Fuyuan Shi, Ruijia Wu, Li YanTao, Qiang Hui, Yuren You, Ting Lu, Chao Tan, Shaoan Zhao, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian2026-03-10🤖 cs.LG

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

This paper introduces BioAgent Bench, a comprehensive evaluation suite and dataset for assessing AI agents in bioinformatics, which reveals that while frontier models can reliably construct multi-step pipelines, they lack robustness against perturbations and may be unsuitable for privacy-sensitive applications compared to open-weight alternatives.

Dionizije Fa, Marko Čuljak, Bruno Pandža, Mateo Čupic2026-03-10💻 cs

RedSage: A Cybersecurity Generalist LLM

The paper introduces RedSage, an open-source, locally deployable cybersecurity LLM trained on a massive curated dataset and agentic augmentation pipeline, which achieves state-of-the-art performance on specialized cybersecurity benchmarks while also improving general reasoning capabilities.

Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto Damiani2026-03-10💬 cs.CL

← Previous Next →