Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

This study evaluates seven state-of-the-art large language models in the underrepresented Nepali cultural context using a Dual-Metric Bias Assessment framework, revealing that while explicit agreement with biased statements is measurable, implicit generative bias is distinct, follows a non-linear relationship with temperature, and is poorly predicted by agreement metrics, thereby highlighting the critical need for culturally grounded datasets and evaluation strategies.

Ashish Pandey, Tek Raj Chhetri2026-03-10💬 cs.CL

Gradient Iterated Temporal-Difference Learning

This paper introduces Gradient Iterated Temporal-Difference (GTD) learning, a novel algorithm that modifies iterated TD by computing gradients over moving targets to achieve the stability of gradient methods while matching the competitive learning speed of semi-gradient methods across diverse benchmarks like Atari games.

Théo Vincent, Kevin Gerhardt, Yogesh Tripathi, Habib Maraqten, Adam White, Martha White, Jan Peters, Carlo D'Eramo2026-03-10🤖 cs.LG

AI Steerability 360: A Toolkit for Steering Large Language Models

The paper introduces AI Steerability 360, an open-source, Hugging Face-native Python toolkit that provides a unified interface for composing, evaluating, and comparing diverse large language model steering methods across input, structural, state, and output control surfaces.

Erik Miehling, Karthikeyan Natesan Ramamurthy, Praveen Venkateswaran, Irene Ko, Pierre Dognin, Moninder Singh, Tejaswini Pedapati, Avinash Balakrishnan, Matthew Riemer, Dennis Wei, Inge Vejsbjerg, Elizabeth M. Daly, Kush R. Varshney2026-03-10💬 cs.CL

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

The paper introduces SynPlanResearch-R1, a framework that synthesizes tool-use trajectories to encourage deeper exploration during supervised fine-tuning, thereby overcoming the limitations of reinforcement learning with verifiable rewards and significantly improving research agent performance across multiple benchmarks.

Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan, Tao Yang, Fengran Mo, Jiacheng Lin, Xian Li, Jingbo Shang2026-03-10💬 cs.CL

Hospitality-VQA: Decision-Oriented Informativeness Evaluation for Vision-Language Models

This paper introduces a formal framework for "informativeness" and a corresponding hospitality-specific VQA dataset to evaluate Vision-Language Models, revealing that while current models struggle with decision-oriented reasoning, their performance significantly improves with modest domain-specific finetuning.

Jeongwoo Lee, Baek Duhyeong, Eungyeol Han, Soyeon Shin, Gukin han, Seungduk Kim, Jaehyun Jeon, Taewoo Jeong2026-03-10🤖 cs.LG

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

This paper introduces CCR-Bench, a novel benchmark designed to rigorously evaluate large language models on complex, real-world industrial tasks involving entangled content-format requirements and intricate logical workflows, revealing significant performance gaps in even state-of-the-art models.

Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan Feng2026-03-10💬 cs.CL

Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

This paper introduces a particle filtering framework to rigorously analyze the accuracy-cost tradeoffs of parallel inference methods in large language models, establishing theoretical guarantees and identifying fundamental limits while demonstrating that sampling error alone does not fully predict final model accuracy.

Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, Akshay Krishnamurthy2026-03-10🤖 cs.LG

Designing probabilistic AI monsoon forecasts to inform agricultural decision-making

This paper presents a decision-theory framework and a blended AI-statistical forecasting system that successfully delivered skillful, tailored monsoon onset predictions to 38 million Indian farmers in 2025, enabling better agricultural decision-making under uncertainty.

Colin Aitken, Rajat Masiwal, Adam Marchakitus, Katherine Kowal, Mayank Gupta, Tyler Yang, Amir Jina, Pedram Hassanzadeh, William R. Boos, Michael Kremer2026-03-10🤖 cs.LG

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

EveryQuery is a novel electronic health record foundation model that achieves efficient, zero-shot clinical prediction by directly estimating outcome likelihoods through task-conditioned pre-training, thereby outperforming computationally expensive autoregressive baselines—particularly for rare events—while currently facing limitations in complex disjunctive reasoning tasks.

Payal Chandak, Gregory Kondas, Isaac Kohane, Matthew McDermott2026-03-10💻 cs