cs.DC papers | Gist.Science

Scalable and Performant Data Loading

This paper introduces SPDL, an open-source, framework-agnostic library that significantly accelerates GPU data loading by leveraging concurrent thread pool execution with GIL release, achieving up to 74% faster iteration and reduced resource usage compared to PyTorch DataLoader while demonstrating further performance gains with Free-Threaded Python.

Moto Hira, Christian Puhrsch, Valentin Andrei, Roman Malinovskyy, Gael Le Lan, Abhinandan Krishnan, Joseph Cummings, Victor Bourgin, Olga Gerasimova, Miguel Martin, Gokul Gunasekaran, Yuta Inoue, Alex J Turner, Raghuraman KrishnamoorthiWed, 11 Ma💻 cs

The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

This paper proposes a Capability Coherence System (CCS) that maps memory consistency models to identity management, demonstrating through simulation that a Release Consistency-directed revocation strategy (RCC) achieves a constant bound on unauthorized operations independent of agent velocity, thereby outperforming traditional time-bounded approaches by orders of magnitude in high-speed agentic environments.

Vladyslav ParakhinWed, 11 Ma💻 cs

Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

This paper proposes a task-based scheduling framework that ensures end-to-end data freshness in safety-critical multi-rate systems by introducing a Consensus Offset Search algorithm to align task releases with data lifespan constraints, thereby eliminating the artificial latency of Logical Execution Time and the inefficiency of redundant oversampling while preserving Global EDF schedulability.

José Luis Conradi Hoffmann, Antônio Augusto FröhlichWed, 11 Ma💻 cs

Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers

This paper presents a case study demonstrating that a virtualized XRootD frontend architecture, utilizing a heterogeneous VM cluster with BBR congestion control and TCP extensions, successfully sustained an aggregate throughput of 51.3 Gb/s and peaked at 41.5 Gb/s to Fermilab under high-intensity WAN transfer conditions.

J M da Silva, M A Costa, R L IopeWed, 11 Ma💻 cs

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

This paper introduces Flash-KMeans, an IO-aware and contention-free GPU implementation that eliminates memory bottlenecks in the assignment stage and resolves atomic write contention in the update stage through novel kernel-level innovations, achieving up to 17.9 $\times$ speedup over existing baselines and enabling $k$ -means as a high-performance online primitive.

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Xiaoze Fan, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Kurt Keutzer, Song Han, Chenfeng Xu, Ion StoicaWed, 11 Ma💻 cs

PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

This paper introduces PIM-SHERPA, a software-only method that resolves memory attribute and layout inconsistencies in product-level PIM-enabled systems to enable efficient on-device LLM inference, achieving significant memory capacity savings while maintaining near-theoretical performance.

Sunjung Lee, Sanghoon Cha, Hyeonsu Kim, Seungwoo Seo, Yuhwan Ro, Sukhan Lee, Byeongho Kim, Yongjun Park, Kyomin Sohn, Seungwon Lee, Jaehoon YuWed, 11 Ma💻 cs

Hierarchical Observe-Orient-Decide-Act Enabled UAV Swarms in Uncertain Environments: Frameworks, Potentials, and Challenges

This paper proposes a hierarchical Observe-Orient-Decide-Act (H-OODA) framework that integrates cloud-edge-terminal layers and network function virtualization to enhance the adaptability, scalability, and decision-making efficiency of UAV swarms operating in uncertain environments.

Ziye Jia, Yao Wu, Qihui Wu, Lijun He, Qiuming Zhu, Fuhui Zhou, Zhu HanWed, 11 Ma💻 cs

Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration

Nezha is a distributed key-value store that resolves the I/O overhead caused by overlapping persistence operations in Raft-based systems by integrating key-value separation with an optimized persistence strategy and leveled garbage collection, thereby achieving significant throughput improvements while maintaining strong consistency.

Yangyang Wang, Yucong Dong, Ziqian Cheng, Zichen XuWed, 11 Ma💻 cs

Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices

This paper establishes a finite-blocklength rate-distortion framework for heterogeneous random fields on finite lattices that explicitly incorporates tile-based processing constraints, providing non-asymptotic bounds and a second-order expansion to quantify the effects of spatial correlation, heterogeneity, and tile size on compression performance.

Sujata Sinha, Vishwas Rao, Robert Underwood, David Lenz, Sheng Di, Franck Cappello, Lingjia LiuWed, 11 Ma🔢 math

Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores

This paper presents the first direct programming of FP64 tensor cores on NVIDIA GPUs to accelerate high-order finite element simulations within the MFEM library, achieving up to 2× performance and 83% energy efficiency gains while demonstrating near-perfect weak scaling across nearly 10,000 GPUs on the Alps exascale system.

Jiqun Tu, Ian Karlin, John Camier, Veselin Dobrev, Tzanio Kolev, Stefan Henneking, Omar GhattasWed, 11 Ma💻 cs

Lockbox -- A Zero Trust Architecture for Secure Processing of Sensitive Cloud Workloads

This paper presents Lockbox, a Zero Trust architecture that ensures the secure processing of sensitive cloud workloads by enforcing strict isolation, least-privilege access, and end-to-end encryption, thereby enabling enterprises to safely leverage advanced capabilities like AI without compromising their security posture.

Vamshi Krishna Thotempudi, Mahima Agarwal, Raghav Batta, Anjali MangalWed, 11 Ma💻 cs

DeZent: Decentralized z-Anonymity with Privacy-Preserving Coordination

This paper introduces deZent, a decentralized implementation of z-anonymity that utilizes stochastic counting structures and secure sums to coordinate privacy-preserving data anonymization across sensor networks, achieving performance comparable to centralized approaches while significantly reducing communication overhead and minimizing trust in a central entity.

Carolin Brunn, Florian TschorschWed, 11 Ma💻 cs

Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy

This paper introduces the Randomized Distributed Function Computation (RDFC) framework, a semantic communication approach that achieves local differential privacy and significantly reduces transmission rates compared to lossless methods, even in scenarios without shared randomness, by leveraging strong coordination metrics and randomized function generation.

Onur GünlüWed, 11 Ma⚡ eess

Serving Compound Inference Systems on Datacenter GPUs

JigsawServe is a novel serving framework that optimizes latency, accuracy, and GPU resource costs for compound inference systems by jointly selecting model variants and spatially partitioning GPUs, achieving up to 11.3x higher throughput and significantly improved resource efficiency compared to prior work.

Sriram Devata, Rahul Singh, Sarita AdveWed, 11 Ma💻 cs

Extension of ACETONE C code generator for multi-core architectures

This paper proposes extending the ACETONE C code generator, originally limited to sequential code, to support multi-core architectures by formally defining a processor assignment problem and outlining a future implementation of parallel code generation, scheduling heuristics, and synchronization mechanisms.

Yanis Aït-Aïssa (IRIT-TRACES), Thomas Carle (IRIT-TRACES), Sergei Chichin, Benjamin Lesage, Claire PagettiWed, 11 Ma💻 cs

Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service

This paper introduces Kareto, an adaptive multi-objective optimizer that efficiently navigates the complex configuration space of tiered KV cache storage to dynamically balance cost, throughput, and latency, significantly outperforming static strategies in LLM inference services.

Xianzhe Zheng, Zhengheng Wang, Ruiyan Ma, Rui Wang, Xiyu Wang, Rui Chen, Peng Zhang, Sicheng Pan, Zhangheng Huang, Chenxin Wu, Yi Zhang, Bo Cai, Kan Liu, Teng Ma, Yin Du, Dong Deng, Sai Wu, Guoyun Zhu, Wei Zhang, Feifei LiWed, 11 Ma💻 cs

RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs

The paper presents RSH-SpMM, a fine-grained row-structured hybrid kernel for GPU-based Sparse Matrix-Matrix Multiplication that utilizes adaptive row partitioning, RS-Tile representation, and load-balanced reordering to achieve 1.27x to 6.13x speedups over state-of-the-art methods by effectively handling extreme sparsity irregularity.

Aiying Li, Jingwei Sun, Han Li, Wence Ji, Guangzhong SunWed, 11 Ma💻 cs

Enhancing Computational Efficiency in Multiscale Systems Using Deep Learning of Coordinates and Flow Maps

This paper proposes a deep learning framework that jointly discovers optimal coordinates and flow maps to enable precise, computationally efficient time-stepping for multiscale systems, achieving state-of-the-art predictive accuracy with reduced costs on complex models like the Fitzhugh-Nagumo neuron and Kuramoto-Sivashinsky equations.

Asif Hamid, Danish Rafiq, Shahkar Ahmad Nahvi, Mohammad Abid BazazWed, 11 Ma🤖 cs.LG

A Survey on Decentralized Federated Learning

This survey systematically reviews decentralized federated learning methods from 2018 to early 2026, categorizing them into traditional distributed and blockchain-based architectures, proposing a unified challenge-driven taxonomy, and outlining future research directions to address security, privacy, and system-level trade-offs in coordinator-free settings.

Edoardo Gabrielli, Anthony Di Pietro, Dario Fenoglio, Giovanni Pica, Gabriele TolomeiWed, 11 Ma🤖 cs.LG

Multi-DNN Inference of Sparse Models on Edge SoCs

This paper introduces SparseLoom, a system that employs model stitching to recombine subgraphs from sparse models without re-training, thereby significantly improving throughput, reducing memory overhead, and lowering Service Level Objective violation rates for multi-DNN inference on edge SoCs compared to state-of-the-art systems.

Jiawei Luo, Di Wu, Simon Dobson, Blesson VargheseWed, 11 Ma🤖 cs.LG