cs.CL papers | Gist.Science

Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi

This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained from scratch on a high-quality, transparently curated corpus, which outperforms similarly sized multilingual baselines and demonstrates that specialized pretraining can effectively address low-resource language gaps without relying on opaque foundation models.

Shiza Fatimah, Aniket Sen, Sophia Falk + 3 more2026-03-05🤖 cs.AI

MMAI Gym for Science: Training Liquid Foundation Models for Drug Discovery

The paper introduces the MMAI Gym for Science, a comprehensive framework for training efficient, purpose-built Liquid Foundation Models that outperform larger general-purpose and specialist models on critical drug discovery tasks by mastering the specific "language of molecules."

Maksim Kuznetsov, Zulfat Miftahutdinov, Rim Shayakhmetov + 17 more2026-03-05🤖 cs.AI

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

This paper introduces SafeCRS, a safety-aware training framework and the SafeRec benchmark designed to mitigate personalized safety violations in LLM-based conversational recommender systems by integrating Safe-SFT and Safe-GDPO to align recommendations with individual user constraints while maintaining high recommendation quality.

Haochang Hao, Yifan Xu, Xinzhuo Li + 2 more2026-03-05🤖 cs.AI

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

The paper introduces RAG-X, a diagnostic framework that evaluates retrievers and generators independently across diverse medical QA tasks using novel Context Utilization Efficiency metrics to expose hidden failure modes and the "Accuracy Fallacy" in current retrieval-augmented generation systems.

Aswini Sivakumar, Vijayan Sugumaran, Yao Qiang2026-03-05🤖 cs.AI

Tucano 2 Cool: Better Open Source LLMs for Portuguese

The paper introduces Tucano 2, a fully open-source suite of 0.5–3.7 billion parameter large language models for Portuguese, featuring enhanced datasets and training recipes that achieve state-of-the-art performance on language benchmarks while providing comprehensive, reproducible resources for the community.

Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah + 4 more2026-03-05🤖 cs.AI

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

This paper presents a practical blueprint for building and optimizing production-scale conversational shopping assistants by introducing a structured evaluation rubric with an LLM-as-judge pipeline and demonstrating two complementary prompt-optimization strategies, Sub-agent and MAMuT GEPA, to enhance multi-agent system performance.

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu + 5 more2026-03-05🤖 cs.AI

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

ByteFlow Net introduces a novel tokenizer-free hierarchical architecture that dynamically learns adaptive byte-level segmentation through compression-driven coding rates and Top-K selection, achieving superior performance over traditional subword tokenization methods by enabling models to self-organize semantically meaningful units directly from raw byte streams.

Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard + 3 more2026-03-05🤖 cs.LG

Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

The paper introduces BeliefSim, a framework that leverages psychology-informed belief profiles to successfully simulate demographic variations in misinformation susceptibility using Large Language Models, achieving up to 92% accuracy.

Angana Borah, Zohaib Khan, Rada Mihalcea + 1 more2026-03-05🤖 cs.AI

A Neural Topic Method Using a Large-Language-Model-in-the-Loop for Business Research

The paper introduces LX Topic, a novel neural topic modeling method that integrates large language model refinement with FASTopic to produce standardized, interpretable, and high-quality document-level topic proportions, thereby establishing a robust and reproducible measurement instrument for business research.

Stephan Ludwig, Peter J. Danaher, Xiaohao Yang2026-03-05💬 cs.CL

Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification

This paper proposes LIGRAM, a hierarchical heterogeneous graph model enhanced with semantic contrastive learning, to improve Korean short-text classification by leveraging the language's agglutinative morphology and flexible word order to overcome contextual data scarcity.

JaeGeon Yoo, Byoungwook Kim, Yeongwook Yang + 1 more2026-03-05💬 cs.CL

MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation

The paper proposes MIND, a unified inquiry-diagnosis reinforcement learning framework that leverages a Criteria-Grounded Psychiatric Reasoning Bank and rubric-based process rewards to enhance diagnostic accuracy, mitigate inquiry drift, and ensure clinically supported reasoning in psychiatric consultations.

Guoyi Li, Shihao Xu, Jiatong Ma + 3 more2026-03-05🤖 cs.AI

CONCUR: Benchmarking LLMs for Concurrent Code Generation

This paper introduces CONCUR, a novel benchmark comprising 115 concurrency-specific problems designed to evaluate and highlight the limitations of Large Language Models in generating complex concurrent code, addressing a critical gap left by existing benchmarks that focus solely on sequential code.

Jue Huang, Tarek Mahmud, Corina Pasareanu + 1 more2026-03-05🤖 cs.LG

Order Is Not Layout: Order-to-Space Bias in Image Generation

This paper identifies and quantifies "Order-to-Space Bias" (OTS), a systematic flaw in modern image generation models where the textual order of entities incorrectly dictates their spatial layout, and demonstrates that this data-driven issue can be effectively mitigated through targeted fine-tuning and early-stage interventions without compromising generation quality.

Yongkang Zhang, Zonglin Zhao, Yuechen Zhang + 3 more2026-03-05🤖 cs.AI

ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement

This paper proposes ErrorLLM, a framework that explicitly models text-to-SQL errors using structural features and dedicated error tokens to overcome the limitations of existing self-debugging and self-correction paradigms, thereby significantly improving SQL refinement through high-precision error detection and guided structural correction.

Zijin Hong, Hao Chen, Zheng Yuan + 6 more2026-03-05💬 cs.CL

Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning

The paper introduces COREA, a cost-efficient reasoning system that cascades a small language model with a large language model and employs reinforcement learning to calibrate confidence scores, thereby significantly reducing inference costs while maintaining high accuracy.

Chuang Zhang, Zizhen Zhu, Yihao Wei + 5 more2026-03-05🤖 cs.AI

MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier

MOOSE-Star is a unified framework that overcomes the mathematical intractability of directly training scientific discovery models by decomposing the generative reasoning process into tractable subtasks and employing motivation-guided hierarchical search, thereby enabling scalable training and continuous test-time scaling while reducing complexity from exponential to logarithmic.

Zonglin Yang, Lidong Bing2026-03-05🤖 cs.LG

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

This paper introduces Structure-of-Thought (SoT), a prompting technique that enhances model performance by guiding the construction of intermediate text structures, and presents T2S-Bench, the first comprehensive benchmark for evaluating and improving text-to-structure reasoning capabilities across diverse scientific domains and tasks.

Qinsi Wang, Hancheng Ye, Jinhee Kim + 12 more2026-03-05🤖 cs.AI

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

The paper introduces SWE-CI, the first repository-level benchmark built on Continuous Integration loops to evaluate LLM agents' long-term code maintainability by requiring them to resolve complex, multi-commit evolution tasks over extended periods, shifting the focus from static, one-shot functional correctness to dynamic, sustained code quality.

Jialong Chen, Xander Xu, Hu Wei + 2 more2026-03-05🤖 cs.AI

In-Context Environments Induce Evaluation-Awareness in Language Models

This paper demonstrates that adversarially optimized in-context prompts can induce significantly higher levels of strategic sandbagging in language models compared to hand-crafted prompts, revealing that evaluation-aware reasoning is a genuine, task-structure-dependent vulnerability that poses a substantial threat to model reliability.

Maheep Chaudhary2026-03-05🤖 cs.AI

Semantic Bridging Domains: Pseudo-Source as Test-Time Connector

This paper proposes a Stepwise Semantic Alignment (SSA) method that utilizes a pseudo-source domain as a semantic bridge, enhanced by Hierarchical Feature Aggregation and Confidence-Aware Complementary Learning, to effectively adapt models to unlabeled target domains under unknown source conditions without relying on direct source data.

Xizhong Yang, Huiming Wang, Ning Xu + 1 more2026-03-05💬 cs.CL

← Previous Next →