Dynamic Precision Math Engine for Linear Algebra and Trigonometry Acceleration on Xtensa LX6 Microcontrollers

This paper presents a Dynamic Precision Math Engine for ESP32 microcontrollers that integrates Q16.16 fixed-point arithmetic, a CORDIC trigonometric module, and a cache-aware matrix kernel to achieve significant speedups in linear algebra and trigonometry through a runtime-switchable architecture that balances integer efficiency with floating-point precision.

Elian Alfonso Lopez PreciadoWed, 11 Ma💻 cs

Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores

This paper presents the first direct programming of FP64 tensor cores on NVIDIA GPUs to accelerate high-order finite element simulations within the MFEM library, achieving up to 2× performance and 83% energy efficiency gains while demonstrating near-perfect weak scaling across nearly 10,000 GPUs on the Alps exascale system.

Jiqun Tu, Ian Karlin, John Camier, Veselin Dobrev, Tzanio Kolev, Stefan Henneking, Omar GhattasWed, 11 Ma💻 cs

The qsqs Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference

This paper introduces the qsqs inequality to demonstrate that Mixture-of-Experts (MoE) models suffer from a structural "double penalty" of routing fragmentation and memory constraints during inference, often rendering them significantly less efficient than quality-matched dense models for long-context serving despite their training-time FLOP advantages.

Vignesh Adhinarayanan, Nuwan JayasenaWed, 11 Ma🤖 cs.LG

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

ARKV is a lightweight, adaptive framework that dynamically allocates precision levels to KV cache tokens based on per-layer attention dynamics and token importance, achieving a 4x reduction in memory usage while preserving ~97% of baseline accuracy for long-context LLM inference without requiring retraining or architectural modifications.

Jianlong Lei, Shashikant IlagerWed, 11 Ma🤖 cs.AI

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

This paper introduces two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that significantly reduce the accuracy gap between the hardware-efficient MXFP4 format and NVIDIA's NVFP4 standard in Large Language Models, achieving near-parity performance with minimal computational overhead.

Jatin Chhugani, Geonhwa Jeong, Bor-Yiing Su, Yunjie Pan, Hanmei Yang, Aayush Ankit, Jiecao Yu, Summer Deng, Yunqing Chen, Nadathur Satish, Changkyu KimWed, 11 Ma🤖 cs.AI

RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems

RAGPerf is an open-source, end-to-end benchmarking framework that decouples Retrieval-Augmented Generation (RAG) pipelines into modular components to enable flexible configuration, comprehensive performance and accuracy profiling, and realistic workload simulation with negligible overhead.

Shaobo Li, Yirui Zhou, Yuan Xu, Kevin Chen, Daniel Waddington, Swaminathan Sundararaman, Hubertus Franke, Jian HuangThu, 12 Ma💻 cs

Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F2\mathbb{F}_2

This paper introduces "Linear Layouts," a novel framework that models tensor layouts as linear algebra operations over F2\mathbb{F}_2 to enable generic, efficient, and bug-free layout definitions and conversions for deep learning workloads, successfully integrating with the Triton compiler to overcome the limitations of existing case-by-case approaches.

Keren Zhou, Mario Lezcano, Adam Goucher, Akhmed Rakhmati, Jeff Niu, Justin Lebar, Pawel Szczerbuk, Peter Bell, Phil Tillet, Thomas Raoux, Zahi MoudallalMon, 09 Ma💻 cs

Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks

This paper investigates parallelization strategies for deploying dense LLMs, demonstrating that while Tensor Parallelism optimizes latency and Pipeline Parallelism enhances throughput, a hybrid approach allows for effective control over the inherent latency-throughput tradeoff to meet specific application requirements.

Burak Topcu, Musa Oguzhan Cim, Poovaiah Palangappa, Meena Arunachalam, Mahmut Taylan KandemirMon, 09 Ma🤖 cs.LG

Unlocking Python's Cores: Hardware Usage and Energy Implications of Removing the GIL

This study evaluates Python 3.14.2's experimental free-threaded build, revealing that while it significantly improves execution time and energy efficiency for parallelizable workloads, it incurs higher memory usage and increased energy consumption for sequential or highly contended tasks, indicating that its adoption depends on specific workload characteristics rather than offering a universal performance boost.

José Daniel Montoya Salazar2026-03-06💻 cs