Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

This paper presents a practical blueprint for building and optimizing production-scale conversational shopping assistants by introducing a structured evaluation rubric with an LLM-as-judge pipeline and demonstrating two complementary prompt-optimization strategies, Sub-agent and MAMuT GEPA, to enhance multi-agent system performance.

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu + 5 more2026-03-05🤖 cs.AI

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

ByteFlow Net introduces a novel tokenizer-free hierarchical architecture that dynamically learns adaptive byte-level segmentation through compression-driven coding rates and Top-K selection, achieving superior performance over traditional subword tokenization methods by enabling models to self-organize semantically meaningful units directly from raw byte streams.

Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard + 3 more2026-03-05🤖 cs.LG

Order Is Not Layout: Order-to-Space Bias in Image Generation

This paper identifies and quantifies "Order-to-Space Bias" (OTS), a systematic flaw in modern image generation models where the textual order of entities incorrectly dictates their spatial layout, and demonstrates that this data-driven issue can be effectively mitigated through targeted fine-tuning and early-stage interventions without compromising generation quality.

Yongkang Zhang, Zonglin Zhao, Yuechen Zhang + 3 more2026-03-05🤖 cs.AI

MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier

MOOSE-Star is a unified framework that overcomes the mathematical intractability of directly training scientific discovery models by decomposing the generative reasoning process into tractable subtasks and employing motivation-guided hierarchical search, thereby enabling scalable training and continuous test-time scaling while reducing complexity from exponential to logarithmic.

Zonglin Yang, Lidong Bing2026-03-05🤖 cs.LG

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

This paper introduces Structure-of-Thought (SoT), a prompting technique that enhances model performance by guiding the construction of intermediate text structures, and presents T2S-Bench, the first comprehensive benchmark for evaluating and improving text-to-structure reasoning capabilities across diverse scientific domains and tasks.

Qinsi Wang, Hancheng Ye, Jinhee Kim + 12 more2026-03-05🤖 cs.AI