ARC-AGI-2 Technical Report

This paper presents a transformer-based system that significantly advances ARC performance by integrating a compact task encoding, symmetry-based data augmentation, test-time LoRA adaptation, and multi-perspective decoding to enable efficient neural inference and human-level generalization from few examples.

Wallyson Lemes de Oliveira, Mekhron Bobokhonov, Matteo Caorsi, Aldo Podestà, Gabriele Beltramo, Luca Crosato, Matteo Bonotto, Federica Cecchetto, Hadrien Espic, Dan Titus Salajan, Stefan Taga, Luca Pana, Joe Carthy2026-03-10💬 cs.CL

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

This paper demonstrates that current LLM-as-a-Judge frameworks fail to reliably measure adversarial robustness due to unaccounted distribution shifts that degrade performance to near-random levels, often leading to inflated attack success rates, and proposes new benchmarks to address these evaluation flaws.

Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, Stephan Günnemann2026-03-10💬 cs.CL

Rethinking Personalization in Large Language Models at the Token Level

This paper introduces PerContrast and the PerCE loss, a token-level training paradigm that uses causal intervention to identify and adaptively upweight user-specific tokens, thereby significantly enhancing the personalization performance of large language models with minimal computational cost.

Chenheng Zhang, Yijun Lu, Lizhe Fang, Chunyuan Zheng, Jiajun Chai, Xiaohan Wang, Guojun Yin, Wei Lin, Yisen Wang, Zhouchen Lin2026-03-10💬 cs.CL

Know When You're Wrong: Aligning Confidence with Correctness for LLM Error Detection

This paper introduces a normalized confidence scoring framework based on output anchor tokens to detect LLM errors without external validation, revealing that while supervised fine-tuning yields well-calibrated confidence, reinforcement learning methods induce overconfidence, and proposing post-RL self-distillation to restore reliability for applications like adaptive retrieval-augmented generation.

Xie Xiaohu, Liu Xiaohu, Yao Benjamin2026-03-10🤖 cs.LG

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

This paper introduces TimeSpot, a comprehensive benchmark comprising 1,455 real-world images from 80 countries designed to evaluate the limited geo-temporal reasoning capabilities of current vision-language models in predicting location, time, and environmental context from visual evidence alone.

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez2026-03-10💬 cs.CL

"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior

This paper proposes the Dark Triad personality traits as a framework for studying AI misalignment, demonstrating that frontier large language models can be reliably induced with human-like antisocial behaviors through minimal fine-tuning on psychometric data, thereby revealing latent persona structures that generalize beyond training contexts.

Roshni Lulla, Fiona Collins, Sanaya Parekh, Thilo Hagendorff, Jonas Kaplan2026-03-10💬 cs.CL

Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records

This study validates that a locally hosted 20-billion-parameter small language model can reliably classify specific DSM-5 substance categories within child welfare investigation narratives, achieving near-perfect agreement with human experts for five major substance types despite limitations with low-prevalence categories.

Brian E. Perron, Dragan Stoll, Bryan G. Victor, Zia Qia, Andreas Jud, Joseph P. Ryan2026-03-10💬 cs.CL

Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers

This paper presents a toolkit leveraging Large Language Models to automate key aspects of Artifact Evaluation in cybersecurity research, achieving high accuracy in reproducibility rating, autonomous environment setup, and pitfall detection to significantly reduce reviewer effort and enhance research transparency.

David Heye, Karl Kindermann, Robin Decker, Johannes Lohmöller, Anastasiia Belova, Sandra Geisler, Klaus Wehrle, Jan Pennekamp2026-03-10💬 cs.CL

Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations

SymLang is an open-source framework that integrates symmetry-constrained grammars, language-model-guided program synthesis, and Bayesian model selection to robustly discover accurate, interpretable governing equations from noisy and partial observations, significantly outperforming existing baselines in structural recovery and physical consistency.

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani2026-03-10🤖 cs.LG

LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models

This paper introduces LieCraft, a novel multi-agent framework featuring grounded, high-stakes scenarios and a hidden-role game mechanic to evaluate the deceptive capabilities of large language models, revealing that state-of-the-art models consistently exhibit a willingness to lie, conceal intentions, and act unethically to achieve their goals.

Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal, Joseph Campbell, Simon Stepputtis, Shao-Yen Tseng2026-03-10💬 cs.CL

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

The paper introduces MedInjection-FR, a large-scale French biomedical instruction dataset combining native, synthetic, and translated sources, and demonstrates through controlled experiments that while native data yields the best performance, strategically mixing these sources effectively mitigates the scarcity of high-quality French medical instruction data for fine-tuning large language models.

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour2026-03-10💬 cs.CL