cs.AI papers | Gist.Science

Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones

This paper reveals that SpeechLLM backbones struggle with conversational disfluencies due to a bias toward semantic abstraction over structural fidelity, with performance varying by architecture and fine-tuning often compromising generalization despite achieving state-of-the-art results.

Maria Teleki, Sai Janjur, Haoran Liu + 11 more2026-03-06💻 cs

Complexity-Regularized Proximal Policy Optimization

This paper introduces Complexity-Regularized Proximal Policy Optimization (CR-PPO), a novel algorithm that replaces standard entropy regularization with a self-regulating complexity term—defined as the product of Shannon entropy and disequilibrium—to maintain beneficial stochasticity while reducing sensitivity to hyperparameter tuning and avoiding the overriding of reward signals.

Luca Serfilippi, Giorgio Franceschelli, Antonio Corradi + 1 more2026-03-06💻 cs

BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving

BridgeDrive introduces a novel anchor-guided diffusion bridge policy that ensures theoretical consistency between forward and reverse processes to transform coarse expert trajectories into refined, safe, and reactive closed-loop plans, achieving state-of-the-art performance on the Bench2Drive benchmark.

Shu Liu, Wenlin Chen, Weihao Li + 7 more2026-03-06💻 cs

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

This paper investigates the phenomenon of subliminal learning in language model distillation, revealing that hidden bias transfer occurs under hard distillation not through global entanglement, but via a small set of critical "divergence tokens" processed in early layers, making the effect both mechanistically specific and fragile to prompt variations.

Simon Schrodi, Elias Kempf, Fazl Barez + 1 more2026-03-06💻 cs

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

BeyondBench introduces a contamination-resistant evaluation framework that uses on-the-fly algorithmic problem generation to assess the true reasoning capabilities of 101 language models across 44 tasks, revealing significant performance gaps in complex problem-solving and the critical role of tool usage.

Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi + 5 more2026-03-06💻 cs

Pretraining Large Language Models with NVFP4

This paper introduces a novel NVFP4 training framework that combines Random Hadamard transforms, 2D quantization, stochastic rounding, and selective high-precision layers to successfully pretrain a 12-billion-parameter model on 10 trillion tokens with performance comparable to FP8 baselines, thereby demonstrating the viability of stable 4-bit precision training for large language models.

NVIDIA, Felix Abecassis, Anjulie Agrusa + 87 more2026-03-06💻 cs

PrefDisco: Benchmarking Proactive Personalized Reasoning

This paper introduces PrefDisco, a novel benchmarking framework and evaluation methodology that transforms static tasks into interactive scenarios to assess and improve large language models' ability to proactively identify user preferences and adapt their reasoning chains for personalized alignment in just-in-time, cold-start situations.

Shuyue Stella Li, Avinandan Bose, Faeze Brahman + 4 more2026-03-06💻 cs

EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations

This paper introduces EgoTraj-Bench, the first real-world benchmark for ego-centric trajectory prediction under noisy observations, and proposes BiFlow, a dual-stream flow matching model with an EgoAnchor mechanism that achieves state-of-the-art robustness by concurrently denoising historical inputs and forecasting future motion.

Jiayi Liu, Jiaming Zhou, Ke Ye + 3 more2026-03-06💻 cs

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Graph2Eval is a knowledge-graph-driven framework that automatically generates scalable, semantically consistent, and solvable multimodal agent tasks by leveraging structured subgraph sampling and multi-stage filtering, resulting in the Graph2Eval-Bench dataset which significantly improves evaluation reliability over existing baselines.

Yurun Chen, Xavier Hu, Yuhan Liu + 8 more2026-03-06💻 cs

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

This paper introduces SpineMed, a clinician-co-designed ecosystem featuring the 450k-instance SpineMed-450k dataset and the SpineBench evaluation framework, which together enable and validate significant improvements in level-aware, multimodal reasoning for spine disorder diagnosis and surgical planning.

Ming Zhao, Wenhui Dong, Yang Zhang + 23 more2026-03-06💻 cs

MachaGrasp: Morphology-Aware Cross-Embodiment Dexterous Hand Articulation Generation for Grasping

MachaGrasp is an eigengrasp-based, end-to-end framework that generates dexterous grasp articulations across different hand embodiments by leveraging morphology embeddings and a kinematic-aware loss, achieving high success rates in both simulation and real-world few-shot adaptation scenarios.

Heng Zhang, Kevin Yuchen Ma, Mike Zheng Shou + 2 more2026-03-06💻 cs

Beyond Prefixes: Graph-as-Memory Cross-Attention for Knowledge Graph Completion with Large Language Models

This paper proposes Graph-as-Memory Tuning (GMT), a novel framework that enhances Knowledge Graph Completion by replacing shallow prefix concatenation with a deep cross-attention mechanism that dynamically injects compressed, context-aware graph memory tokens into frozen Large Language Models for more robust reasoning.

Ruitong Liu, Boxu Lin, Peize Li + 4 more2026-03-06💻 cs

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

This paper introduces OmniVideoBench, a large-scale benchmark comprising 1,000 rigorously annotated QA pairs derived from 628 diverse videos, designed to comprehensively evaluate the synergistic audio-visual reasoning capabilities of multimodal large language models and highlight the significant performance gap between current models and human reasoning.

Caorui Li, Yu Chen, Yiyan Ji + 40 more2026-03-06💻 cs

True Self-Supervised Novel View Synthesis is Transferable

This paper introduces XFactor, the first geometry-free self-supervised model that achieves true novel view synthesis by disentangling camera pose from scene content through a novel augmentation scheme, thereby enabling pose transferability across different 3D scenes without relying on explicit 3D inductive biases.

Thomas W. Mitchel, Hyunwoo Ryu, Vincent Sitzmann2026-03-06💻 cs

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

This paper demonstrates that narrow finetuning leaves distinct, interpretable biases in LLM activations that can be extracted via model diffing to reconstruct training data characteristics and enhance interpretability, while warning that such models may not accurately represent broader finetuning scenarios and suggesting that mixing pretraining data can mitigate these overfitting traces.

Julian Minder, Clément Dumas, Stewart Slocum + 4 more2026-03-06💻 cs

CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

This paper introduces CBF-RL, a framework that integrates Control Barrier Functions directly into the reinforcement learning training process to internalize safety constraints, thereby enabling safe, robust, and filter-free deployment of policies in real-world scenarios like humanoid robot navigation.

Lizhi Yang, Blake Werner, Massimiliano de Sa + 1 more2026-03-06💻 cs

Pursuing Minimal Sufficiency in Spatial Reasoning

The paper introduces MSSR, a dual-agent framework that enhances Vision-Language Models' spatial reasoning by programmatically curating a Minimal Sufficient Set of 3D perception results from expert models to overcome bottlenecks caused by inadequate 3D understanding and redundant information, thereby achieving state-of-the-art performance on challenging benchmarks.

Yejie Guo, Yunzhong Hou, Wufei Ma + 2 more2026-03-06💻 cs

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

This paper introduces SceneCOT, a novel framework that achieves grounded question-answering in 3D scenes by decoupling complex reasoning into manageable steps with visual clues, supported by the newly created SCENECOT-185K dataset, which demonstrates state-of-the-art performance and represents the first successful application of Chain-of-Thought reasoning to 3D scene understanding.

Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu + 2 more2026-03-06💻 cs

Schrödinger Bridge Mamba for One-Step Speech Enhancement

The paper introduces Schrödinger Bridge Mamba (SBM), a novel one-step speech enhancement model that synergizes the Schrödinger Bridge training paradigm with the Mamba architecture to achieve superior denoising and dereverberation performance with real-time feasibility, outperforming strong generative and discriminative baselines.

Jing Yang, Sirui Wang, Chao Wu + 2 more2026-03-06💻 cs

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

The paper introduces Grasp Any Region (GAR), a multimodal large language model that leverages RoI-aligned feature replay to integrate global context for precise, interactive region-level understanding and compositional reasoning, achieving state-of-the-art performance on both image and video benchmarks.

Haochen Wang, Yuhao Wang, Tao Zhang + 13 more2026-03-06💻 cs

← Previous Next →