cs.PF papers | Gist.Science

Dynamic Precision Math Engine for Linear Algebra and Trigonometry Acceleration on Xtensa LX6 Microcontrollers

This paper presents a Dynamic Precision Math Engine for ESP32 microcontrollers that integrates Q16.16 fixed-point arithmetic, a CORDIC trigonometric module, and a cache-aware matrix kernel to achieve significant speedups in linear algebra and trigonometry through a runtime-switchable architecture that balances integer efficiency with floating-point precision.

Elian Alfonso Lopez PreciadoWed, 11 Ma💻 cs

Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores

This paper presents the first direct programming of FP64 tensor cores on NVIDIA GPUs to accelerate high-order finite element simulations within the MFEM library, achieving up to 2× performance and 83% energy efficiency gains while demonstrating near-perfect weak scaling across nearly 10,000 GPUs on the Alps exascale system.

Jiqun Tu, Ian Karlin, John Camier, Veselin Dobrev, Tzanio Kolev, Stefan Henneking, Omar GhattasWed, 11 Ma💻 cs

bsort: A theoretically efficient non-comparison-based sorting algorithm for integer and floating-point numbers

The paper introduces bsort, a non-comparison-based sorting algorithm that unifies the sorting of signed/unsigned integers and floating-point numbers with an asymptotic time complexity of $O(wn)$ and auxiliary space of $O(w)$ , demonstrating competitive performance against optimized hybrid libraries for small word sizes.

Benjamín GuzmánWed, 11 Ma💻 cs

ChatNeuroSim: An LLM Agent Framework for Automated Compute-in-Memory Accelerator Deployment and Optimization

This paper introduces ChatNeuroSim, an LLM-based agent framework that automates the entire Compute-in-Memory (CIM) accelerator deployment and optimization workflow—including simulation, parameter management, and design space pruning—to significantly reduce the time and effort required for design space exploration.

Ming-Yen Lee, Shimeng YuWed, 11 Ma💻 cs

Multi-DNN Inference of Sparse Models on Edge SoCs

This paper introduces SparseLoom, a system that employs model stitching to recombine subgraphs from sparse models without re-training, thereby significantly improving throughput, reducing memory overhead, and lowering Service Level Objective violation rates for multi-DNN inference on edge SoCs compared to state-of-the-art systems.

Jiawei Luo, Di Wu, Simon Dobson, Blesson VargheseWed, 11 Ma🤖 cs.LG

The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference

This paper introduces the $qs$ inequality to demonstrate that Mixture-of-Experts (MoE) models suffer from a structural "double penalty" of routing fragmentation and memory constraints during inference, often rendering them significantly less efficient than quality-matched dense models for long-context serving despite their training-time FLOP advantages.

Vignesh Adhinarayanan, Nuwan JayasenaWed, 11 Ma🤖 cs.LG

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

This paper demonstrates that Mamba-2's state space duality can be implemented entirely using standard XLA primitives without custom kernels, achieving portable, host-synchronization-free $O(1)$ autoregressive caching with high performance across CPU, NVIDIA GPU, and Google Cloud TPU hardware.

Cosmo SantoniWed, 11 Ma🤖 cs.AI

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

ARKV is a lightweight, adaptive framework that dynamically allocates precision levels to KV cache tokens based on per-layer attention dynamics and token importance, achieving a 4x reduction in memory usage while preserving ~97% of baseline accuracy for long-context LLM inference without requiring retraining or architectural modifications.

Jianlong Lei, Shashikant IlagerWed, 11 Ma🤖 cs.AI

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

This paper introduces two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that significantly reduce the accuracy gap between the hardware-efficient MXFP4 format and NVIDIA's NVFP4 standard in Large Language Models, achieving near-parity performance with minimal computational overhead.

Jatin Chhugani, Geonhwa Jeong, Bor-Yiing Su, Yunjie Pan, Hanmei Yang, Aayush Ankit, Jiecao Yu, Summer Deng, Yunqing Chen, Nadathur Satish, Changkyu KimWed, 11 Ma🤖 cs.AI

A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture

This paper presents a fully device-resident, multi-GPU architecture that achieves near-zero host-device communication and 99.7% parallel efficiency through lock-free work-stealing and optimized shared-memory tiling, enabling the verification of Goldbach's conjecture up to $10^{13}$ in just 133.5 seconds on a four-GPU system with a 45.6× speedup over previous methods.

Isaac Llorente-SaguerTue, 10 Ma🔢 math

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

DyLLM is a training-free inference framework that accelerates Masked Diffusion Language Models by identifying and recomputing only temporally stable "salient tokens" while reusing cached activations for the rest, achieving up to 9.6x higher throughput with minimal accuracy loss.

Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, Jung Ho AhnTue, 10 Ma💬 cs.CL

RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems

RAGPerf is an open-source, end-to-end benchmarking framework that decouples Retrieval-Augmented Generation (RAG) pipelines into modular components to enable flexible configuration, comprehensive performance and accuracy profiling, and realistic workload simulation with negligible overhead.

Shaobo Li, Yirui Zhou, Yuan Xu, Kevin Chen, Daniel Waddington, Swaminathan Sundararaman, Hubertus Franke, Jian HuangThu, 12 Ma💻 cs

RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators

RedFuser is an automatic framework that employs a formal theoretical methodology to identify and fuse cascaded reduction operations into optimized single-loop kernels, achieving significant speedups over state-of-the-art AI compilers while matching the performance of hand-written solutions.

Xinsheng Tang, Yangcheng Li, Nan Wang, Zhiyi Shu, Xingyu Ling, Junna Xing, Peng Zhou, Qiang LiuThu, 12 Ma🤖 cs.AI

Reexamining Paradigms of End-to-End Data Movement

This paper argues that achieving high-performance end-to-end data movement requires shifting focus from raw network bandwidth to a holistic hardware-software co-design approach, introducing the "Drainage Basin Pattern" to identify and resolve bottlenecks across six critical paradigms ranging from network latency to host-side factors.

Chin Fang, Timothy Stitt, Michael J. McManus, Toshio MoriyaMon, 09 Ma💻 cs

Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using $\mathbb{F}_2$

This paper introduces "Linear Layouts," a novel framework that models tensor layouts as linear algebra operations over $\mathbb{F}_2$ to enable generic, efficient, and bug-free layout definitions and conversions for deep learning workloads, successfully integrating with the Triton compiler to overcome the limitations of existing case-by-case approaches.

Keren Zhou, Mario Lezcano, Adam Goucher, Akhmed Rakhmati, Jeff Niu, Justin Lebar, Pawel Szczerbuk, Peter Bell, Phil Tillet, Thomas Raoux, Zahi MoudallalMon, 09 Ma💻 cs

Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks

This paper investigates parallelization strategies for deploying dense LLMs, demonstrating that while Tensor Parallelism optimizes latency and Pipeline Parallelism enhances throughput, a hybrid approach allows for effective control over the inherent latency-throughput tradeoff to meet specific application requirements.

Burak Topcu, Musa Oguzhan Cim, Poovaiah Palangappa, Meena Arunachalam, Mahmut Taylan KandemirMon, 09 Ma🤖 cs.LG

FluxSieve: Unifying Streaming and Analytical Data Planes for Scalable Cloud Observability

This paper presents FluxSieve, a unified architecture that embeds lightweight in-stream precomputation and filtering into the data ingestion path to reconcile streaming and analytical data planes, thereby achieving orders-of-magnitude improvements in query performance for large-scale cloud observability with negligible overhead.

Adriano Vogel, Sören Henning, Otmar Ertl2026-03-06💻 cs

Rethinking Temporal Models for TinyML: LSTM versus 1D-CNN in Resource-Constrained Devices

This paper demonstrates that 1D Convolutional Neural Networks are a superior alternative to LSTMs for TinyML applications on resource-constrained devices, offering comparable or higher accuracy while significantly reducing memory usage and inference latency.

Bidyut Saha, Riya Samanta2026-03-06💻 cs

Unlocking Python's Cores: Hardware Usage and Energy Implications of Removing the GIL

This study evaluates Python 3.14.2's experimental free-threaded build, revealing that while it significantly improves execution time and energy efficiency for parallelizable workloads, it incurs higher memory usage and increased energy consumption for sequential or highly contended tasks, indicating that its adoption depends on specific workload characteristics rather than offering a universal performance boost.

José Daniel Montoya Salazar2026-03-06💻 cs

Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

This survey systematically analyzes state-of-the-art dynamic routing and cascading strategies for efficiently selecting among multiple independent large language models based on query characteristics, proposing a conceptual framework to balance performance and cost while highlighting open challenges in generalization.

Yasmin Moslem, John D. Kelleher2026-03-06💻 cs

cs.PF