Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

This paper empirically evaluates the robustness of 13 Large Language Models against five structured Chain-of-Thought perturbation types, revealing that while model scaling significantly mitigates math errors, it offers limited protection against unit conversion errors and that vulnerability patterns vary heterogeneously across different corruption types.

Ashwath Vaithinathan Aravindan, Mayank Kejriwal2026-03-05🤖 cs.AI

Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

This paper proposes a decision-safe framework for ranking large language models that utilizes a contextual Bradley-Terry-Luce model to construct statistically valid confidence sets for prompt-dependent rankings, thereby addressing the limitations of point estimates by quantifying uncertainty and distinguishing meaningful performance differences from noise.

Angel Rodrigo Avelar Menendez, Yufeng Liu, Xiaowu Dai2026-03-05🤖 cs.LG

GreenPhase: A Green Learning Approach for Earthquake Phase Picking

GreenPhase is an efficient, interpretable, and sustainable deep-learning model based on the Green Learning framework that achieves state-of-the-art earthquake detection and phase picking performance on the STEAD dataset while reducing computational costs by approximately 83% through its unique feed-forward, multi-resolution architecture that eliminates backpropagation.

Yixing Wu, Shiou-Ya Wang, Dingyi Nie + 5 more2026-03-05🤖 cs.AI

Scalable Contrastive Causal Discovery under Unknown Soft Interventions

This paper proposes a scalable, contrastive causal discovery model that leverages paired observational and single-regime soft interventional data to construct globally consistent causal structures, theoretically proving its ability to recover identifiable edges and outperform non-contrastive methods in both in-distribution and out-of-distribution scenarios.

Mingxuan Zhang, Khushi Desai, Sopho Kevlishvili + 1 more2026-03-05🤖 cs.LG

[Re] FairDICE: A Gap Between Theory And Practice

This replication study of FairDICE, a multi-objective offline reinforcement learning algorithm, reveals that while its theoretical claims hold, a critical code error initially reduced it to standard behavior cloning and underspecified hyperparameters hindered reproducibility, though corrected experiments demonstrate its potential to scale to complex environments despite a reliance on online tuning.

Peter Adema, Karim Galliamov, Aleksey Evstratovskiy + 1 more2026-03-05🤖 cs.LG

Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

This paper demonstrates that a significant portion of transformer MLP nonlinearity is redundant and context-dependent, showing that a lightweight gating mechanism can dynamically replace these computations with linear surrogates to reduce computational waste or, when applied strategically with full retraining, actively improve model performance by eliminating harmful nonlinearities.

Peter Balogh2026-03-05🤖 cs.LG