FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

FinSheet-Bench introduces a synthetic benchmark modeled on real private equity fund structures to evaluate LLMs on financial spreadsheet tasks, revealing that even the best-performing models currently lack the accuracy required for unsupervised professional use, particularly on complex, large-scale documents, and suggesting that reliable extraction will require separating document understanding from deterministic computation.

Jan Ravnik, Matjaž Ličen, Felix Bührmann, Bithiah Yuan, Felix Stinson, Tanvi Singh2026-03-10💻 cs

Norm-Hierarchy Transitions in Representation Learning: When and Why Neural Networks Abandon Shortcuts

This paper introduces the Norm-Hierarchy Transition (NHT) framework, which explains that neural networks delay learning structured representations in favor of spurious shortcuts because weight decay slowly drives the model from high-norm solutions to lower-norm ones, with the transition delay logarithmically scaling to the ratio between these norms.

Truong Xuan Khanh, Truong Quynh Hoa2026-03-10🤖 cs.LG

VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models

This paper introduces VisualScratchpad, an interactive inference-time analysis tool that leverages sparse autoencoders and attention mechanisms to visualize and debug vision language models by linking visual concepts to text tokens, thereby revealing previously underexplored failure modes such as limited cross-modal alignment and misleading visual concepts.

Hyesu Lim, Jinho Choi, Taekyung Kim, Byeongho Heo, Jaegul Choo, Dongyoon Han2026-03-10💻 cs

Agora: Teaching the Skill of Consensus-Finding with AI Personas Grounded in Human Voice

The paper introduces Agora, an AI-powered platform that leverages LLMs to simulate diverse human perspectives on policy issues, enabling users to practice consensus-building and demonstrating through a preliminary study that access to authentic voice explanations significantly enhances problem-solving skills and the quality of collective decisions compared to viewing aggregate data alone.

Suyash Fulay, Prerna Ravi, Emily Kubin, Shrestha Mohanty, Michiel Bakker, Deb Roy2026-03-10💻 cs

AgrI Challenge: A Data-Centric AI Competition for Cross-Team Validation in Agricultural Vision

The AgrI Challenge introduces a data-centric competition framework featuring Cross-Team Validation to demonstrate that while single-source training suffers from significant generalization gaps in agricultural vision, collaborative multi-source training on independently collected, heterogeneous datasets dramatically improves model robustness and real-world performance.

Mohammed Brahimi, Karim Laabassi, Mohamed Seghir Hadj Ameur, Aicha Boutorh, Badia Siab-Farsi, Amin Khouani, Omar Farouk Zouak, Seif Eddine Bouziane, Kheira Lakhdari, Abdelkader Nabil Benghanem2026-03-10🤖 cs.LG

Latent Generative Models with Tunable Complexity for Compressed Sensing and other Inverse Problems

This paper introduces tunable-complexity priors for generative models like diffusion models, normalizing flows, and VAEs by leveraging nested dropout, demonstrating that adaptively adjusting model dimensionality significantly improves reconstruction performance across various inverse problems compared to fixed-complexity baselines.

Sean Gunn, Jorio Cocola, Oliver De Candido, Vaggos Chatziafratis, Paul Hand2026-03-10🤖 cs.LG

The Yerkes-Dodson Curve for AI Agents: Emergent Cooperation Under Environmental Pressure in Multi-Agent LLM Simulations

This paper demonstrates that environmental pressure in multi-agent LLM simulations follows a Yerkes-Dodson inverted-U relationship, where medium stress optimizes emergent cooperative trade while extreme pressure causes behavioral collapse, and suggests that calibrating such pressure serves as an effective curriculum design strategy for agent development.

Ivan Pasichnyk2026-03-10💻 cs

Scaling Laws in the Tiny Regime: How Small Models Change Their Mistakes

This paper reveals that in the sub-20M parameter "tiny" regime, models follow steeper but non-uniform scaling laws where increasing size not only reduces overall error but fundamentally alters the structure of mistakes, shifts capacity from easy to hard classes, and paradoxically degrades calibration, necessitating validation at the specific target model size for edge AI deployment.

Mohammed Alnemari, Rizwan Qureshi, Nader Begrazadah2026-03-10🤖 cs.LG

Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios

This paper addresses the challenge of domain-specific machine translation quality estimation in low-resource scenarios by demonstrating that while prompt-only methods are fragile for open-weight models, adapting intermediate Transformer layers via Low-Rank Adaptation (ALOPE) and Low-Rank Multiplicative Adaptation (LoRMA) significantly improves robustness and performance across English-to-Indic language pairs.

Namrata Patil Gurav, Akashdeep Ranu, Archchana Sindhujan, Diptesh Kanojia2026-03-10🤖 cs.LG

SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions

This Systematization of Knowledge (SoK) paper establishes the first unified framework for Agentic Retrieval-Augmented Generation (RAG) by formalizing autonomous loops as decision-making processes, proposing a comprehensive taxonomy and architectural decomposition, critiquing current evaluation limitations and systemic risks, and outlining critical research directions for building reliable and scalable agentic systems.

Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali, Shiva Gaire2026-03-10💬 cs.CL