Scalable and Performant Data Loading

This paper introduces SPDL, an open-source, framework-agnostic library that significantly accelerates GPU data loading by leveraging concurrent thread pool execution with GIL release, achieving up to 74% faster iteration and reduced resource usage compared to PyTorch DataLoader while demonstrating further performance gains with Free-Threaded Python.

Moto Hira, Christian Puhrsch, Valentin Andrei, Roman Malinovskyy, Gael Le Lan, Abhinandan Krishnan, Joseph Cummings, Victor Bourgin, Olga Gerasimova, Miguel Martin, Gokul Gunasekaran, Yuta Inoue, Alex J Turner, Raghuraman KrishnamoorthiWed, 11 Ma💻 cs

The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

This paper proposes a Capability Coherence System (CCS) that maps memory consistency models to identity management, demonstrating through simulation that a Release Consistency-directed revocation strategy (RCC) achieves a constant bound on unauthorized operations independent of agent velocity, thereby outperforming traditional time-bounded approaches by orders of magnitude in high-speed agentic environments.

Vladyslav ParakhinWed, 11 Ma💻 cs

Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

This paper proposes a task-based scheduling framework that ensures end-to-end data freshness in safety-critical multi-rate systems by introducing a Consensus Offset Search algorithm to align task releases with data lifespan constraints, thereby eliminating the artificial latency of Logical Execution Time and the inefficiency of redundant oversampling while preserving Global EDF schedulability.

José Luis Conradi Hoffmann, Antônio Augusto FröhlichWed, 11 Ma💻 cs

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

This paper introduces Flash-KMeans, an IO-aware and contention-free GPU implementation that eliminates memory bottlenecks in the assignment stage and resolves atomic write contention in the update stage through novel kernel-level innovations, achieving up to 17.9×\times speedup over existing baselines and enabling kk-means as a high-performance online primitive.

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Xiaoze Fan, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Kurt Keutzer, Song Han, Chenfeng Xu, Ion StoicaWed, 11 Ma💻 cs

PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

This paper introduces PIM-SHERPA, a software-only method that resolves memory attribute and layout inconsistencies in product-level PIM-enabled systems to enable efficient on-device LLM inference, achieving significant memory capacity savings while maintaining near-theoretical performance.

Sunjung Lee, Sanghoon Cha, Hyeonsu Kim, Seungwoo Seo, Yuhwan Ro, Sukhan Lee, Byeongho Kim, Yongjun Park, Kyomin Sohn, Seungwon Lee, Jaehoon YuWed, 11 Ma💻 cs

Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices

This paper establishes a finite-blocklength rate-distortion framework for heterogeneous random fields on finite lattices that explicitly incorporates tile-based processing constraints, providing non-asymptotic bounds and a second-order expansion to quantify the effects of spatial correlation, heterogeneity, and tile size on compression performance.

Sujata Sinha, Vishwas Rao, Robert Underwood, David Lenz, Sheng Di, Franck Cappello, Lingjia LiuWed, 11 Ma🔢 math

Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores

This paper presents the first direct programming of FP64 tensor cores on NVIDIA GPUs to accelerate high-order finite element simulations within the MFEM library, achieving up to 2× performance and 83% energy efficiency gains while demonstrating near-perfect weak scaling across nearly 10,000 GPUs on the Alps exascale system.

Jiqun Tu, Ian Karlin, John Camier, Veselin Dobrev, Tzanio Kolev, Stefan Henneking, Omar GhattasWed, 11 Ma💻 cs

Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy

This paper introduces the Randomized Distributed Function Computation (RDFC) framework, a semantic communication approach that achieves local differential privacy and significantly reduces transmission rates compared to lossless methods, even in scenarios without shared randomness, by leveraging strong coordination metrics and randomized function generation.

Onur GünlüWed, 11 Ma⚡ eess

Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service

This paper introduces Kareto, an adaptive multi-objective optimizer that efficiently navigates the complex configuration space of tiered KV cache storage to dynamically balance cost, throughput, and latency, significantly outperforming static strategies in LLM inference services.

Xianzhe Zheng, Zhengheng Wang, Ruiyan Ma, Rui Wang, Xiyu Wang, Rui Chen, Peng Zhang, Sicheng Pan, Zhangheng Huang, Chenxin Wu, Yi Zhang, Bo Cai, Kan Liu, Teng Ma, Yin Du, Dong Deng, Sai Wu, Guoyun Zhu, Wei Zhang, Feifei LiWed, 11 Ma💻 cs

Enhancing Computational Efficiency in Multiscale Systems Using Deep Learning of Coordinates and Flow Maps

This paper proposes a deep learning framework that jointly discovers optimal coordinates and flow maps to enable precise, computationally efficient time-stepping for multiscale systems, achieving state-of-the-art predictive accuracy with reduced costs on complex models like the Fitzhugh-Nagumo neuron and Kuramoto-Sivashinsky equations.

Asif Hamid, Danish Rafiq, Shahkar Ahmad Nahvi, Mohammad Abid BazazWed, 11 Ma🤖 cs.LG

A Survey on Decentralized Federated Learning

This survey systematically reviews decentralized federated learning methods from 2018 to early 2026, categorizing them into traditional distributed and blockchain-based architectures, proposing a unified challenge-driven taxonomy, and outlining future research directions to address security, privacy, and system-level trade-offs in coordinator-free settings.

Edoardo Gabrielli, Anthony Di Pietro, Dario Fenoglio, Giovanni Pica, Gabriele TolomeiWed, 11 Ma🤖 cs.LG