cs.CL papers | Gist.Science

Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models

This paper introduces "Model Medicine," a comprehensive clinical research program that treats AI models as biological-like organisms by establishing a taxonomy of subdisciplines, a behavioral genetics framework, and novel diagnostic tools like Neural MRI to systematically understand, diagnose, and treat model disorders.

Jihoon Jeong2026-03-06💻 cs

Solving an Open Problem in Theoretical Physics using AI-Assisted Discovery

This paper presents a neuro-symbolic AI system that autonomously solved an open problem in theoretical physics by deriving novel, exact analytical solutions for the gravitational radiation power spectrum of cosmic strings, thereby surpassing previous partial results and demonstrating AI's potential to accelerate mathematical discovery.

Michael P. Brenner, Vincent Cohen-Addad, David Woodruff2026-03-06💻 cs

Interactive Benchmarks

This paper proposes "Interactive Benchmarks," a unified evaluation paradigm that assesses model intelligence through active information acquisition and reasoning under budget constraints in interactive proofs and games, demonstrating that current models still have significant room for improvement in these dynamic scenarios.

Baoqing Yue, Zihan Zhu, Yifan Zhang + 3 more2026-03-06💻 cs

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

The paper introduces IF-RewardBench, a comprehensive meta-evaluation benchmark that utilizes a listwise preference graph paradigm to more accurately assess judge models for instruction-following, demonstrating its superior correlation with downstream task performance compared to existing benchmarks.

Bosi Wen, Yilin Niu, Cunxiang Wang + 5 more2026-03-06💻 cs

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

The paper introduces DARE, a lightweight retrieval model that integrates data distribution information with function metadata to significantly improve R package retrieval and LLM agent performance in statistical analysis tasks, supported by a new knowledge base and evaluation framework.

Maojun Sun, Yue Wu, Yifei Xie + 5 more2026-03-06💻 cs

HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

HiMAP-Travel is a hierarchical multi-agent framework that addresses long-horizon travel planning with hard constraints by employing a Coordinator and parallel Day Executors linked through transactional monitoring and bargaining protocols, achieving state-of-the-art performance and reduced latency on TravelPlanner and FlexTravelBench benchmarks.

The Viet Bui, Wenjun Li, Yong Liu2026-03-06💻 cs

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

The paper proposes SharedLLM, a novel framework that extends the context window of large language models to over 128K tokens using a multi-scale self-injection architecture with stacked short-context models and a tree-based retrieval structure, achieving superior performance and efficiency without requiring costly long-context pre-training.

Wei Han, Pan Zhou, Shuicheng Yan2026-03-06💻 cs

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

TSEmbed is a universal multimodal embedding framework that addresses task conflict by synergizing Mixture-of-Experts with Low-Rank Adaptation and introducing Expert-Aware Negative Sampling to achieve state-of-the-art performance on both benchmark and industrial datasets.

Yebo Wu, Feng Liu, Ziwei Xie + 4 more2026-03-06💻 cs

Privacy-Aware Camera 2.0 Technical Report

This paper proposes a novel privacy-preserving perception framework that utilizes an AI Flow-based edge-cloud architecture to transform raw images into mathematically irreconstructible abstract feature vectors at the source, thereby enabling secure behavior recognition and semantic reconstruction via dynamic contours while completely eliminating visual data leakage in sensitive environments.

Huan Song, Shuyu Tian, Ting Long + 5 more2026-03-06💻 cs

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

This paper introduces RLSTA, a reinforcement learning framework that leverages single-turn capabilities as stable anchors to overcome "Contextual Inertia," thereby enabling large language models to effectively integrate new information and maintain robust reasoning in multi-turn interactions.

Xingwu Chen, Zhanqiu Zhang, Yiwen Guo + 1 more2026-03-06💻 cs

Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm

This paper proposes Clustering-Sampling-Voting (CSV), a novel framework that significantly reduces the linear latency and token costs of semantic filtering in large language models by embedding tuples into semantic clusters, sampling subsets for evaluation, and inferring cluster-level labels through voting strategies, thereby achieving sublinear complexity with strong error guarantees.

Nan Hou, Kangfei Zhao, Jiadong Xie + 1 more2026-03-06💻 cs

Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation

This paper introduces the Attention Gravitational Field (AGF) concept, which decouples positional encodings from semantic embeddings to optimize Large Language Model architecture, demonstrating that positional correlations follow a power-law distribution consistent with Newton's Law of Universal Gravitation and yielding superior accuracy and interpretability.

Edward Zhang2026-03-06💻 cs

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

This paper compares fact-based memory systems against long-context LLMs for persistent agents, finding that while long-context models generally offer superior factual recall, fact-based memory provides a more cost-effective solution for long-term interactions by maintaining fixed per-turn costs after an initial write phase, with a specific break-even point determined by context length and interaction volume.

Natchanon Pollertlam, Witchayut Kornsuwannawit2026-03-06💬 cs.CL

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

This meta-analysis of 890 results reveals that AI models struggle with automated short-answer scoring due to architectural limitations, vocabulary constraints, and biases, often performing worse on tasks humans find easy and exhibiting racial discrimination in educational contexts.

Michael Hardy2026-03-06💬 cs.CL

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

This paper proposes GDS, a novel method for detecting pre-training data in Large Language Models by analyzing systematic gradient deviations—specifically update magnitudes, locations, and neuron activation patterns—that distinguish familiar samples from unfamiliar ones, achieving state-of-the-art performance and superior cross-dataset transferability compared to existing likelihood-based or heuristic approaches.

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang + 2 more2026-03-06💬 cs.CL

An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production

This paper presents a novel framework for the simultaneous acquisition of real-time MRI, EEG, and surface EMG to capture brain, muscle, and articulatory activity during speech, featuring a specialized artifact suppression pipeline to overcome technical challenges and enable unprecedented insights into speech neuroscience.

Jihwan Lee, Parsa Razmara, Kevin Huang + 16 more2026-03-06🤖 cs.AI

Why Is RLHF Alignment Shallow? A Gradient Analysis

This paper proves that standard RLHF alignment is inherently shallow because gradient signals vanish once a sequence's harmfulness is determined, and it proposes a recovery penalty objective to ensure alignment gradients persist throughout the entire generation process.

Robin Young2026-03-06🤖 cs.LG

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

SinhaLegal presents a high-quality, 2-million-word corpus of 1,206 Sinhala legislative documents (Acts and Bills) that has been rigorously processed and evaluated to serve as a foundational resource for advancing natural language processing tasks like information extraction, summarization, and legal analysis in the Sinhala language.

Minduli Lasandi, Nevidu Jayatilleke2026-03-06💬 cs.CL

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

HACHIMI is a scalable, multi-agent framework that generates a theory-aligned and distribution-controllable corpus of 1 million synthetic student personas, demonstrating strong fidelity to human responses on educational constructs while providing a standardized resource for educational LLM benchmarking and social science simulations.

Yilin Jiang, Fei Tan, Xuanyu Yin + 2 more2026-03-06💬 cs.CL

FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

This paper introduces FireBench, a new open-source benchmark comprising over 2,400 real-world enterprise and API-driven samples across six capability dimensions, designed to evaluate and improve instruction following in LLMs beyond traditional chat-based constraints.

Yunfan Zhang, Yijie Bei, Jetashree Ravi + 1 more2026-03-06💬 cs.CL

← Previous Next →