cs.AI papers | Gist.Science

A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

This paper critiques the prevalent reliance on fixed-threshold metrics in machine learning evaluation by advocating for a consequentialist framework that prioritizes proper scoring rules like the Brier score, supported by a new decision-theoretic mapping, a practical Python package called `briertools`, and a clipped Brier score variant to bridge the gap between theoretical utility and current practices.

Gerardo Flores, Abigail Schiff, Alyssa H. Smith, Julia A Fukuyama, Ashia C. Wilson2026-03-11🤖 cs.AI

MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers

This paper introduces MCP Bridge, a lightweight, LLM-agnostic RESTful proxy that enables Model Context Protocol servers to run in resource-constrained environments with enhanced security, while also presenting a fine-tuned Qwen3 model that achieves state-of-the-art performance on the MCPToolBench++ benchmark through advanced reinforcement learning techniques.

Arash Ahmadi, Sarah Sharif, Yaser M. Banad2026-03-11🤖 cs.AI

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

This paper introduces Stepwise Guided Policy Optimization (SGPO), a framework that enhances Group Relative Policy Optimization (GRPO) by utilizing a step-wise judge model to provide learning signals from all-negative sample groups, thereby enabling large language models to learn from incorrect reasoning and improving performance across various reasoning benchmarks.

Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin2026-03-11🤖 cs.AI

Let's Verify Math Questions Step by Step

This paper introduces MathQ-Verify, a novel five-stage pipeline that rigorously filters ill-posed or under-specified mathematical questions through format validation, formalization, contradiction detection, and completeness checks, achieving state-of-the-art performance in curating reliable datasets for training large language models.

Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, Wentao Zhang2026-03-11🤖 cs.AI

UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models

The paper introduces UltraEdit, a training-, subject-, and memory-free approach for lifelong language model editing that achieves unprecedented scalability and efficiency by computing parameter shifts in a single step, enabling 7B models to be edited on consumer GPUs with over 2 million updates while outperforming existing methods in speed, memory usage, and accuracy.

Xiaojie Gu, Ziying Huang, Jia-Chen Gu, Kai Zhang2026-03-11🤖 cs.AI

SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning

The paper introduces Saturn, a reinforcement learning framework that leverages Boolean Satisfiability (SAT) problems to overcome scalability, verifiability, and difficulty control limitations in training large language models, resulting in significant reasoning improvements across SAT, math, and programming benchmarks.

Huanyu Liu, Ge Li, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong2026-03-11🤖 cs.AI

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

The paper introduces Daily-Omni, a comprehensive audio-visual benchmark with 1,197 questions designed to evaluate cross-modal temporal reasoning, revealing that current multimodal large language models still struggle with alignment-critical tasks despite strong unimodal performance.

Ziwei Zhou, Rui Wang, Zuxuan Wu, Yu-Gang Jiang2026-03-11🤖 cs.AI

Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review

This paper presents the first systematic review of integrating foundation models into mobile service robotics, analyzing how these technologies address core challenges in perception and control, enabling applications in domestic and healthcare settings while discussing ethical implications and outlining future directions for safe, scalable, and trustworthy deployment.

Matthew Lisondra, Beno Benhabib, Goldie Nejat2026-03-11💬 cs.CL

Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment

The paper proposes TSRating, a novel meta-learning framework that leverages LLMs to generate quality comparisons across diverse domains and trains an efficient TSRater model to accurately and adaptively evaluate time series data quality without requiring extensive hypergradient computations.

Shunyu Wu, Dan Li, Wenjie Feng, Haozheng Ye, Jian Lou, See-Kiong Ng2026-03-11🤖 cs.AI

Cooperative Game-Theoretic Credit Assignment for Multi-Agent Policy Gradients via the Core

This paper proposes CORA, a cooperative game-theoretic credit assignment method that utilizes core allocation and coalition sampling to effectively distribute global advantages among agents in multi-agent reinforcement learning, thereby overcoming the limitations of uniform sharing and enhancing coordinated optimal behavior.

Mengda Ji, Genjiu Xu, Keke Jia, Zekun Duan, Yong Qiu, Jianjun Ge, Mingqiang Li2026-03-11🤖 cs.AI

Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness

The paper introduces ChannelTokenFormer, a unified Transformer-based framework that simultaneously addresses the challenges of complex inter-channel dependencies, asynchronous sampling, and missing values to achieve robust real-world multivariate time series forecasting.

Jinkwan Jang, Hyungjin Park, Jinmyeong Choi, Taesup Kim2026-03-11🤖 cs.AI

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

The paper proposes ConLID, a supervised contrastive learning approach that learns domain-invariant representations to significantly improve language identification performance for low-resource languages on out-of-domain data while maintaining accuracy for high-resource languages.

Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut2026-03-11🤖 cs.AI

OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering

The paper introduces OPENXRD, a comprehensive benchmark framework featuring 217 expert-curated X-ray diffraction questions that evaluates how large language and multimodal models assimilate domain-specific context, revealing that mid-sized models benefit most from high-quality reference materials while very large models often exhibit saturation or interference.

Ali Vosoughi, Ayoub Shahnazari, Yufeng Xi, Zeliang Zhang, Griffin Hess, Chenliang Xu, Niaz Abdolrahim2026-03-11🤖 cs.AI

On the mechanical creation of mathematical concepts

The paper proposes a model of mathematical problem-solving as a belief-update loop that distinguishes between implicit concept formation, which optimizes search within a fixed vocabulary, and explicit concept creation, which introduces new moves to resolve unsolvable problems and argues that while current AI excels at the former, achieving the latter is essential for machines to replicate the distinctive nature of mathematical discovery.

Asvin G2026-03-11🤖 cs.AI

QSpark: Towards Reliable Qiskit Code Generation

The paper introduces QSpark, a fine-tuned Qwen2.5-Coder-32B model leveraging GRPO and ORPO training on synthetic data to significantly outperform general-purpose LLMs in generating reliable Qiskit code, achieving a 56.29% Pass@1 on Qiskit HumanEval while revealing that advanced quantum programming tasks remain unsolved.

Kiana Kheiri, Aamna Aamir, Andriy Miranskyy + 1 more2026-03-11🤖 cs.AI

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models

This paper introduces Latent Policy Steering (LPS), a method that leverages embodiment-agnostic optical flow to pretrain a World Model on diverse datasets, which is then fine-tuned with limited target-embodiment demonstrations to steer and significantly improve visuomotor policies in low-data regimes.

Yiqi Wang, Mrinal Verghese, Jeff Schneider2026-03-11🤖 cs.AI

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

MMGraphRAG addresses the limitations of text-centric GraphRAG by introducing a novel framework that integrates visual scene graphs with text knowledge graphs via spectral clustering-based entity linking (SpecLink) and a new CMEL dataset, achieving state-of-the-art performance in multimodal reasoning and reducing hallucinations.

Xueyao Wan, Hang Yu2026-03-11🤖 cs.AI

Debiasing International Attitudes: LLM Agents for Simulating US-China Perception Changes

This study introduces an LLM-agent framework to simulate U.S. citizens' attitudes toward China from 2005 to 2025, demonstrating that while subjective news framing has a modest impact on negative attitudes, a "devil's advocate" agent is the most effective mechanism for debiasing opinions and producing more human-like cognitive outcomes.

Nicholas Sukiennik, Yichuan Xu, Yuqing Kan, Jinghua Piao, Yuwei Yan, Chen Gao, Yong Li2026-03-11🤖 cs.AI

Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

This paper proposes SFDA-PFT, a lightweight source-free domain adaptation method that utilizes a pretrained translator to map subject-specific style features in the latent space, enabling effective facial expression recognition on unlabeled neutral target data without requiring source data or unstable image synthesis.

Masoumeh Sharafi, Soufiane Belharbi, Muhammad Osama Zeeshan, Houssem Ben Salem, Ali Etemad, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger2026-03-11🤖 cs.AI

AI Blob! LLM-Driven Recontextualization of Italian Television Archives

Inspired by the iconic Italian television program *Blob*, the paper presents AI Blob!, an experimental LLM-driven system that utilizes semantic cataloging, speech recognition, and retrieval-augmented generation to dynamically retrieve, recontextualize, and algorithmically montage archival Italian television footage into new narrative sequences.

Roberto Balestri2026-03-11💬 cs.CL

← Previous Next →