MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

This paper introduces MMTU, a large-scale benchmark comprising over 28,000 questions across 25 real-world expert-level table tasks, designed to comprehensively evaluate and reveal the significant limitations of current frontier models in understanding, reasoning, and manipulating structured tabular data.

Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish2026-03-10🤖 cs.LG

EROICA: Online Performance Troubleshooting for Large-scale Model Training

This paper presents EROICA, the first online troubleshooting system deployed on production-scale GPU clusters (~100,000 GPUs) that effectively diagnoses complex hardware and software performance issues in large-scale model training through fine-grained profiling and differential observability with minimal impact.

Yu Guan, Zhiyu Yin, Haoyu Chen, Sheng Cheng, Chaojie Yang, Kun Qian, Tianyin Xu, Pengcheng Zhang, Yang Zhang, Hanyu Zhao, Yong Li, Wei Lin, Dennis Cai, Ennan Zhai2026-03-10🤖 cs.LG

BemaGANv2: Discriminator Combination Strategies for GAN-based Vocoders in Long-Term Audio Generation

BemaGANv2 is an advanced GAN-based vocoder that enhances long-term audio generation for Text-to-Music and Text-to-Audio applications by integrating Anti-aliased Multi-Periodicity composition modules in the generator and systematically evaluating novel discriminator combination strategies, including the Multi-Envelope Discriminator, to achieve high-fidelity and temporally coherent results.

Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul Kwon2026-03-10🤖 cs.LG

Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

This paper introduces two efficient algorithms, Slate-GLM-OFU and Slate-GLM-TS, for the Logistic Contextual Slate Bandit problem that achieve O~(T)\tilde{O}(\sqrt{T}) regret and NO(1)N^{O(1)} per-round computational complexity by combining local planning with global learning, demonstrating superior performance in both synthetic benchmarks and practical language model applications.

Tanmay Goyal, Gaurav Sinha2026-03-10🤖 cs.LG

Sharpness-Aware Machine Unlearning

This paper characterizes how Sharpness-Aware Minimization (SAM) alters generalization during machine unlearning by abandoning its denoising properties when fitting forget signals, leading to the proposal of "Sharp MinMax"—a novel method that splits the model to simultaneously learn retain signals via SAM and unlearn forget signals via sharpness maximization, thereby achieving superior unlearning performance, reduced feature entanglement, and enhanced privacy.

Haoran Tang, Rajiv Khanna2026-03-10🤖 cs.LG

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy

DemoDiffusion is a one-shot imitation learning method that enables robots to perform diverse manipulation tasks by leveraging kinematic retargeting to derive a rough trajectory from a single human demonstration and refining it with a pre-trained diffusion policy to ensure alignment with plausible robot actions, achieving significantly higher success rates than baseline approaches without requiring task-specific training or paired data.

Sungjae Park, Homanga Bharadhwaj, Shubham Tulsiani2026-03-10🤖 cs.LG

Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models

This paper introduces TableEG, a framework that leverages fine-tuned large language models to generate authentic, distribution-aligned synthetic errors in tabular data, thereby addressing the scarcity of real-world error datasets and establishing a robust benchmark for evaluating data cleaning techniques.

Xinyuan Liu, Jiahui Chen, Bocheng Hu, Yu Sun, Xinyang Chen, Shaoxu Song, Yongxin Tong2026-03-10🤖 cs.LG

Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

This paper identifies a pervasive "agreement bias" in Multimodal LLM verifiers that causes them to over-validate agent behavior, and proposes a lightweight Self-Grounded Verification (SGV) method that significantly improves failure detection and task completion across web navigation, computer use, and robotics by decoupling prior generation from trajectory evaluation.

Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira2026-03-10🤖 cs.LG

Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models

This paper proposes a tree-based Weak-to-Strong Generalization framework that leverages Monte Carlo Tree Search to organize both successful and failure trajectories from weak models, thereby significantly enhancing the reasoning and decision-making capabilities of strong models in complex interactive environments.

Ruimeng Ye, Zihan Wang, Yang Xiao, Zinan Ling, Manling Li, Bo Hui2026-03-10🤖 cs.LG

Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks

This paper investigates how malicious auditees can construct fairness-compliant yet representative-looking samples from non-compliant distributions to deceive auditors, formalizes these manipulation strategies using optimal transport and entropic projections, and proposes statistical tests to detect such distributional manipulation attacks.

Valentin Lafargue, Adriana Laurindo Monteiro, Emmanuelle Claeys, Laurent Risser, Jean-Michel Loubes2026-03-10🤖 cs.LG

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models

This paper introduces a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that exposes a critical "Benchmarking Gap" in medical large language models, revealing that despite high static benchmark scores, most models exhibit profound brittleness, privacy leaks, bias, and hallucinations when subjected to continuous, adversarial stress-testing.

Jiazhen Pan (Cherise), Bailiang Jian (Cherise), Paul Hager (Cherise), Yundi Zhang (Cherise), Che Liu (Cherise), Friedrike Jungmann (Cherise), Hongwei Bran Li (Cherise), Chenyu You (Cherise), Junde Wu (Cherise), Jiayuan Zhu (Cherise), Fenglin Liu (Cherise), Yuyuan Liu (Cherise), Niklas Bubeck (Cherise), Christian Wachinger (Cherise), Chen (Cherise), Chen (Cherise), Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert2026-03-10🤖 cs.LG