PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution

PRECEPT is a unified test-time adaptation framework that enhances LLM agent resilience by integrating deterministic exact-match rule retrieval, conflict-aware memory with Bayesian reliability, and the Pareto-guided COMPASS prompt-evolution loop to achieve superior compositional generalization, continuous learning, and robustness against knowledge drift and adversarial inputs.

Arash Shahmansoori2026-03-11🤖 cs.AI

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

This paper introduces MiniAppBench, a comprehensive benchmark derived from real-world data to evaluate LLMs' ability to generate principle-driven interactive HTML applications, alongside MiniAppEval, an agentic framework that uses browser automation to assess these applications across intention, static, and dynamic dimensions.

Zuhao Zhang, Chengyue Yu, Yuante Li, Chenyi Zhuang, Linjian Mo, Shuai Li2026-03-11🤖 cs.AI

When to Lock Attention: Training-Free KV Control in Video Diffusion

KV-Lock is a training-free framework for DiT-based video diffusion models that dynamically adjusts background key-value locking and classifier-free guidance scales based on hallucination detection to simultaneously enhance foreground quality and maintain background consistency.

Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian Wang2026-03-11🤖 cs.AI

GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation

This paper introduces an open-source framework for Graph Neural Network-based Time Series Anomaly Detection to enable reproducible experimentation and critical evaluation, demonstrating that GNNs enhance both detection performance and interpretability while highlighting the need for standardized metrics and thresholding strategies.

Federico Bello, Gonzalo Chiarlone, Marcelo Fiori, Gastón García González, Federico Larroca2026-03-11🤖 cs.AI

Logics-Parsing-Omni Technical Report

This paper introduces the Omni Parsing framework and the Logics-Parsing-Omni model, which unify document, image, and audio-visual parsing through a three-level hierarchical paradigm of holistic detection, fine-grained recognition, and multi-level interpreting to transform unstructured multimodal signals into traceable, evidence-based structured knowledge.

Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Baoyu Hou, Shuzhao Li, Weidong Ren, Fan Yang, Jiangtao Zhang, Xiaoxiao Xu, Lin Qu2026-03-11🤖 cs.AI

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

The paper introduces EsoLang-Bench, a novel benchmark utilizing esoteric programming languages to expose the limitations of large language models' genuine reasoning capabilities by revealing a dramatic performance gap between their high scores on standard benchmarks and near-zero accuracy on tasks requiring the acquisition of new languages through documentation and experimentation rather than memorization.

Aman Sharma, Paras Chopra2026-03-11🤖 cs.AI

Automatic Cardiac Risk Management Classification using large-context Electronic Patients Health Records

This study demonstrates that a custom Transformer architecture outperforms both traditional machine learning models and zero-shot generative LLMs in automatically classifying cardiac risk from large-context, unstructured Dutch electronic health records, offering a robust alternative to manual administrative coding for geriatric cardiovascular risk management.

Jacopo Vitale, David Della Morte, Luca Bacco, Mario Merone, Mark de Groot, Saskia Haitjema, Leandro Pecchia, Bram van Es2026-03-11🤖 cs.AI

AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

This paper introduces AutoViVQA, a large-scale automatically constructed dataset for Vietnamese Visual Question Answering, and evaluates transformer-based multimodal models alongside various automatic metrics to assess their performance and alignment with human judgment in the Vietnamese context.

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan, Nguyen Quoc Thinh, Tung Le2026-03-11🤖 cs.AI

ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling

The paper proposes ESAinsTOD, a unified end-to-end schema-aware instruction-tuning framework that leverages full-parameter LLM fine-tuning with instruction and schema alignment mechanisms to achieve superior performance, generalization in low-resource settings, and robustness against noise across diverse task-oriented dialog benchmarks.

Dechuan Teng, Chunlin Lu, Libo Qin, Wanxiang Che2026-03-11🤖 cs.AI

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

The paper introduces ActiveUltraFeedback, an efficient active learning pipeline that leverages uncertainty estimates and novel selection strategies like Double Reverse Thompson Sampling to generate high-quality preference data, enabling Large Language Models to achieve superior alignment performance with as little as one-sixth of the annotated data required by static baselines.

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause2026-03-11🤖 cs.AI

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

The paper introduces OOD-MMSafe, a benchmark revealing significant causal blindness in current Multimodal Large Language Models regarding hidden consequences, and proposes the Consequence-Aware Safety Policy Optimization (CASPO) framework to effectively mitigate these risks by shifting safety alignment from intent detection to consequence projection.

Ming Wen, Kun Yang, Jingyu Zhang, Yuxuan Liu, shiwen cui, Shouling Ji, Xingjun Ma2026-03-11🤖 cs.AI

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

This paper introduces MUGEN, a comprehensive benchmark revealing that Large Audio-Language Models struggle with multi-audio understanding as input scaling increases, and demonstrates that combining training-free strategies like Audio-Permutational Self-Consistency with Chain-of-Thought can significantly improve performance.

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee2026-03-11🤖 cs.AI

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

This paper introduces EXPLORE-Bench, a benchmark derived from real first-person videos to evaluate the ability of multimodal large language models to perform long-horizon egocentric scene prediction, revealing significant performance gaps compared to humans and demonstrating that stepwise reasoning offers partial improvements at a computational cost.

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha2026-03-11🤖 cs.AI

Ego: Embedding-Guided Personalization of Vision-Language Models

The paper proposes "Ego," an efficient personalization method for vision-language models that extracts visual tokens representing target concepts via internal attention mechanisms to serve as memory, enabling strong performance across single-concept, multi-concept, and video personalization tasks without requiring additional training stages or external modules.

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi2026-03-11🤖 cs.AI