cs.CL papers | Gist.Science

What Is Missing: Interpretable Ratings for Large Language Model Outputs

This paper introduces the "What Is Missing" (WIM) rating system, which converts natural language feedback describing output deficiencies into interpretable scalar ratings via sentence embedding similarity, thereby improving preference learning signals and enabling qualitative debugging compared to traditional discrete numerical ratings.

Nicholas Stranges, Yimin Yang2026-03-06💻 cs

Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

This survey systematically analyzes state-of-the-art dynamic routing and cascading strategies for efficiently selecting among multiple independent large language models based on query characteristics, proposing a conceptual framework to balance performance and cost while highlighting open challenges in generalization.

Yasmin Moslem, John D. Kelleher2026-03-06💻 cs

SkillNet: Create, Evaluate, and Connect AI Skills

SkillNet is an open infrastructure that addresses the lack of systematic skill accumulation in AI agents by providing a unified ontology, a repository of over 200,000 skills, and evaluation tools to create, connect, and assess skills, thereby significantly enhancing agent performance and efficiency across diverse tasks.

Yuan Liang, Ruobin Zhong, Haoming Xu + 46 more2026-03-06💻 cs

A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science

This study introduces a unified, end-to-end framework for developing combustion-specialized Large Language Models, featuring a massive multimodal knowledge base, a rigorous evaluation benchmark, and a three-stage knowledge-injection pathway that demonstrates the necessity of moving beyond standard retrieval-augmented generation to structured knowledge graphs and continued pretraining to overcome performance ceilings caused by context contamination.

Zonglin Yang, Runze Mao, Tianhao Wu + 3 more2026-03-06💻 cs

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

This paper introduces a novel attack method that induces numerical instability in multimodal large language models by optimizing a specific loss function to generate images, causing significant performance degradation across state-of-the-art models and datasets that is distinct from traditional adversarial perturbations.

Wai Tuck Wong, Jun Sun, Arunesh Sinha2026-03-06💻 cs

Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam

This paper demonstrates that rewriting user queries to reduce ambiguity using answer-free grounding context, rather than simply prepending that context, significantly boosts language model accuracy on challenging benchmarks like Humanity's Last Exam, with gains that cannot be replicated through inference-time prompting alone.

Michael Majurski, Cynthia Matuszek2026-03-06💻 cs

Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

This paper investigates temporal drift in the FreshStack retrieval benchmark by comparing 2024 and 2025 corpus snapshots, finding that while relevant documents migrate between repositories, retrieval model rankings remain highly stable, suggesting that such benchmarks can remain reliable despite temporal corpus evolution.

Nathan Kuissi, Suraj Subrahmanyan, Nandan Thakur + 1 more2026-03-06💻 cs

Adaptive Memory Admission Control for LLM Agents

This paper proposes Adaptive Memory Admission Control (A-MAC), a framework that decomposes memory value into five interpretable factors to enable transparent, efficient, and domain-adaptive long-term memory management for LLM agents, achieving superior precision-recall tradeoffs and reduced latency compared to state-of-the-art systems.

Guilin Zhang, Wei Jiang, Xiejiashan Wang + 5 more2026-03-06💻 cs

From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models

This paper addresses the fragmented understanding of streaming Large Language Models by proposing a unified definition and systematic taxonomy based on data flow and dynamic interaction, while also analyzing their methodologies, real-world applications, and future research directions.

Junlong Tong, Zilong Wang, YuJie Ren + 4 more2026-03-06💻 cs

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

This paper introduces GOLF, a reinforcement learning framework that leverages group-level natural language feedback from external critiques and intra-group attempts to generate actionable refinements, thereby significantly improving sample efficiency and exploration in sparse-reward environments compared to traditional scalar reward methods.

Lei Huang, Xiang Cheng, Chenxiao Zhao + 6 more2026-03-06💻 cs

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

This paper introduces Vibe Code Bench, a novel benchmark featuring 100 web application specifications evaluated by autonomous browser agents, which reveals that even the best frontier models achieve only 58.0% accuracy on end-to-end development tasks and highlights self-testing and evaluator alignment as critical factors for success.

Hung Tran, Langston Nashold, Rayan Krishnan + 2 more2026-03-06💻 cs

Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models

This paper proposes a retrieval-augmented generation method that integrates coordinated semantic alignment and explicit evidence constraints to resolve semantic misalignment and insufficient evidence utilization, thereby enhancing the factual reliability and verifiability of large language model outputs.

Xin Chen, Saili Uday Gadgil, Jiarong Qiu2026-03-06💻 cs

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

This paper introduces iAgentBench, a dynamic open-domain question answering benchmark designed to evaluate the cross-source sensemaking capabilities of information-seeking agents on high-traffic topics by requiring the integration of evidence from multiple sources rather than simple retrieval.

Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta + 1 more2026-03-06💻 cs

Stan: An LLM-based thermodynamics course assistant

This paper presents Stan, a locally deployed, privacy-preserving AI system for an undergraduate thermodynamics course that utilizes open-weight models to simultaneously provide grounded, reference-backed tutoring for students and generate actionable teaching insights for instructors from a shared corpus of lecture transcripts and textbook data.

Eric M. Furst, Vasudevan Venkateshwaran2026-03-06🔬 physics

Using Vision + Language Models to Predict Item Difficulty

This study demonstrates that a multimodal approach combining vision and language models (GPT-4.1-nano) to analyze both visualization images and text features significantly outperforms unimodal methods in predicting the difficulty of data literacy test items for U.S. adults, achieving a mean absolute error of 0.224.

Samin Khan2026-03-06💻 cs

Optimizing Language Models for Crosslingual Knowledge Consistency

This paper introduces Direct Consistency Optimization (DCO), a reinforcement learning-inspired method that significantly improves crosslingual knowledge consistency in large language models by deriving a structured reward function directly from the model itself, thereby eliminating the need for an explicit reward model while outperforming existing approaches.

Tianyu Liu, Jirui Qi, Mrinmaya Sachan + 3 more2026-03-06💻 cs

Non-Zipfian Distribution of Stopwords and Subset Selection Models

This paper proposes a subset selection model based on Hill's functions to explain why stopwords follow a Beta Rank Function distribution while non-stopwords deviate from Zipf's law, validating the model through both empirical data and analytical derivation.

Wentian Li, Oscar Fontanelli2026-03-06💻 cs

Hate Speech Detection using Large Language Models with Data Augmentation and Feature Enhancement

This study evaluates the impact of data augmentation and feature enhancement techniques on hate speech detection across traditional and transformer-based models, revealing that while the open-source gpt-oss-20b achieves the highest overall performance, augmentation strategies significantly boost traditional classifiers like Delta TF-IDF and that detection efficacy varies based on the interaction between dataset properties, model architecture, and enhancement methods.

Brian Jing Hong Nge, Stefan Su, Thanh Thi Nguyen + 3 more2026-03-06💻 cs

Detection of Illicit Content on Online Marketplaces using Large Language Models

This research demonstrates that fine-tuned Large Language Models, particularly Llama 3.2, significantly outperform traditional machine learning and baseline transformer models in detecting and classifying complex, multilingual illicit content on online marketplaces, offering a scalable solution for enhanced platform safety and law enforcement.

Quoc Khoa Tran, Thanh Thi Nguyen, Campbell Wilson2026-03-06💻 cs

AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

This paper proposes a two-layer evaluation framework to assess AI models' ability to simulate justice-specific questioning in moot courts, finding that while models generate realistic questions that cover key legal issues, they still struggle with diversity and sycophancy—shortcomings that naive evaluation methods would miss.

Kylie Zhang, Nimra Nadeem, Lucia Zheng + 2 more2026-03-06💻 cs

← Previous Next →