cs.CL papers | Gist.Science

Linguistic trajectories of bipolar disorder on social media

This study leverages large-scale social media data to demonstrate that language use in individuals with bipolar disorder undergoes significant shifts at diagnosis and exhibits a recurring 12-month seasonal fluctuation in mood-related discussions, offering a valuable longitudinal complement to traditional psychiatric research.

Laurin Plank, Armin Zlomuzica2026-03-06💻 cs

Llama-Mimi: Exploring the Limits of Flattened Speech Language Modeling

This paper introduces Llama-Mimi, a Transformer decoder-based Speech Language Model that flattens multi-level RVQ tokens from the Mimi codec into a single sequence, demonstrating superior performance over hierarchical models in most tasks and achieving state-of-the-art acoustic consistency.

Issa Sugiura, Shuhei Kurita, Yusuke Oda + 1 more2026-03-06💻 cs

Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones

This paper reveals that SpeechLLM backbones struggle with conversational disfluencies due to a bias toward semantic abstraction over structural fidelity, with performance varying by architecture and fine-tuning often compromising generalization despite achieving state-of-the-art results.

Maria Teleki, Sai Janjur, Haoran Liu + 11 more2026-03-06💻 cs

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

BeyondBench introduces a contamination-resistant evaluation framework that uses on-the-fly algorithmic problem generation to assess the true reasoning capabilities of 101 language models across 44 tasks, revealing significant performance gaps in complex problem-solving and the critical role of tool usage.

Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi + 5 more2026-03-06💻 cs

Pretraining Large Language Models with NVFP4

This paper introduces a novel NVFP4 training framework that combines Random Hadamard transforms, 2D quantization, stochastic rounding, and selective high-precision layers to successfully pretrain a 12-billion-parameter model on 10 trillion tokens with performance comparable to FP8 baselines, thereby demonstrating the viability of stable 4-bit precision training for large language models.

NVIDIA, Felix Abecassis, Anjulie Agrusa + 87 more2026-03-06💻 cs

PrefDisco: Benchmarking Proactive Personalized Reasoning

This paper introduces PrefDisco, a novel benchmarking framework and evaluation methodology that transforms static tasks into interactive scenarios to assess and improve large language models' ability to proactively identify user preferences and adapt their reasoning chains for personalized alignment in just-in-time, cold-start situations.

Shuyue Stella Li, Avinandan Bose, Faeze Brahman + 4 more2026-03-06💻 cs

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Graph2Eval is a knowledge-graph-driven framework that automatically generates scalable, semantically consistent, and solvable multimodal agent tasks by leveraging structured subgraph sampling and multi-stage filtering, resulting in the Graph2Eval-Bench dataset which significantly improves evaluation reliability over existing baselines.

Yurun Chen, Xavier Hu, Yuhan Liu + 8 more2026-03-06💻 cs

Beyond Prefixes: Graph-as-Memory Cross-Attention for Knowledge Graph Completion with Large Language Models

This paper proposes Graph-as-Memory Tuning (GMT), a novel framework that enhances Knowledge Graph Completion by replacing shallow prefix concatenation with a deep cross-attention mechanism that dynamically injects compressed, context-aware graph memory tokens into frozen Large Language Models for more robust reasoning.

Ruitong Liu, Boxu Lin, Peize Li + 4 more2026-03-06💻 cs

Detecting Hallucinations in Authentic LLM-Human Interactions

This paper introduces AuthenHallu, the first hallucination detection benchmark derived entirely from authentic LLM-human interactions, which reveals a high prevalence of hallucinations in real-world usage—particularly in challenging domains—and demonstrates the current limitations of using vanilla LLMs as detectors.

Yujie Ren, Niklas Gruhlke, Anne Lauscher2026-03-06💻 cs

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

This paper demonstrates that narrow finetuning leaves distinct, interpretable biases in LLM activations that can be extracted via model diffing to reconstruct training data characteristics and enhance interpretability, while warning that such models may not accurately represent broader finetuning scenarios and suggesting that mixing pretraining data can mitigate these overfitting traces.

Julian Minder, Clément Dumas, Stewart Slocum + 4 more2026-03-06💻 cs

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

The paper introduces Grasp Any Region (GAR), a multimodal large language model that leverages RoI-aligned feature replay to integrate global context for precise, interactive region-level understanding and compositional reasoning, achieving state-of-the-art performance on both image and video benchmarks.

Haochen Wang, Yuhao Wang, Tao Zhang + 13 more2026-03-06💻 cs

EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

This paper introduces EchoMind, the first multi-level benchmark designed to evaluate Speech Language Models' ability to integrate non-lexical vocal cues with linguistic content for empathetic dialogue, revealing that even state-of-the-art models struggle with high-expressive vocal cues and context-aware emotional reasoning.

Li Zhou, Lutong Yu, You Lyu + 6 more2026-03-06💻 cs

Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

This paper introduces the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and multiple writing systems, which enables quantitative analysis of major linguistic shifts in Korean and serves as a foundational resource for training large language models.

Seyoung Song, Nawon Kim, Songeun Chae + 5 more2026-03-06💻 cs

Steering Awareness: Models Can Be Trained to Detect Activation Steering

This paper demonstrates that language models can be fine-tuned to reliably detect and identify activation steering interventions, revealing that such steering is not inherently undetectable and that models trained to recognize it may paradoxically become more susceptible to behavioral manipulation.

Joshua Fonseca Rivera, David Demitri Africa2026-03-06💻 cs

Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation

The paper proposes FlyThinker, an efficient "think-while-generating" framework that employs a parallel latent token-level reasoning model to dynamically guide personalized long-form generation, thereby overcoming the limitations of static one-shot reasoning while maintaining training and inference efficiency.

Chengbing Wang, Yang Zhang, Wenjie Wang + 4 more2026-03-06💻 cs

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

ReFusion introduces a novel masked diffusion model that integrates sequence reorganization with a hybrid parallel-autoregressive decoding strategy to simultaneously achieve full KV cache efficiency, reduce learning complexity, and significantly outperform existing diffusion models while narrowing the performance gap with autoregressive models.

Jia-Nan Li, Jian Guan, Wei Wu + 1 more2026-03-06💻 cs

RePo: Language Models with Context Re-Positioning

This paper introduces RePo, a novel mechanism that leverages a differentiable module to dynamically re-position tokens based on contextual dependencies rather than fixed linear indices, thereby reducing extraneous cognitive load and enhancing LLM performance on tasks involving noisy contexts, structured data, and long-range dependencies.

Huayang Li, Tianyu Zhao, Deng Cai + 1 more2026-03-06💻 cs

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

This paper introduces MCP-SafetyBench, a comprehensive benchmark leveraging real-world Model Context Protocol (MCP) servers to evaluate the safety of large language models in multi-turn, cross-tool scenarios, revealing that current models remain vulnerable to diverse MCP-specific attacks despite a significant safety-utility trade-off.

Xuanjun Zong, Zhiqi Shen, Lei Wang + 2 more2026-03-06💻 cs

From Word to World: Can Large Language Models be Implicit Text-based World Models?

This paper proposes a three-level framework to evaluate large language models as implicit text-based world models, demonstrating that while they can enhance agent learning through coherent state prediction and synthetic experience generation, their effectiveness is critically dependent on behavioral coverage and environment complexity.

Yixia Li, Hongru Wang, Jiahao Qiu + 7 more2026-03-06💻 cs

Parallel Token Prediction for Language Models

This paper introduces Parallel Token Prediction (PTP), a framework that accelerates language model inference by predicting multiple tokens in a single forward pass through deterministic functions of random input variables, achieving a 2.4x speedup over autoregressive decoding.

Felix Draxler, Justus Will, Farrin Marouf Sofian + 3 more2026-03-06💻 cs

← Previous Next →