cs.DC papers | Gist.Science

On the Solvability of Byzantine-tolerant Reliable Communication in Dynamic Networks

This paper establishes the necessary and sufficient conditions for achieving reliable communication in dynamic networks subject to Byzantine faults, while also extending the analysis to scenarios involving message losses, unbounded local computation delays, and authenticated messages.

Silvia Bonomi (DIAG UNIROMA), Giovanni Farina (UNICUSANO), Sébastien Tixeuil (NPA)Thu, 12 Ma💻 cs

Data Augmentation and Convolutional Network Architecture Influence on Distributed Learning

This paper investigates how convolutional neural network architectures and data augmentation strategies impact model accuracy and computational efficiency within distributed learning environments, aiming to provide insights for optimizing CNN deployment in resource-intensive scenarios.

Victor Forattini Jansen, Emanuel Teixeira Martins, Yasmin Souza Lima, Flavio de Oliveira Silva, Rodrigo Moreira, Larissa Ferreira Rodrigues MoreiraThu, 12 Ma💻 cs

Topological Analysis for Identifying Anomalies in Serverless Platforms

This paper introduces a topological model using Hodge decomposition to analyze complex serverless information flows, distinguishing between locally correctable components and persistent harmonic modes to derive practical remediation strategies like "dumping effects" that address structural inefficiencies without requiring complete architectural restructuring.

Gianluca Reali, Mauro FemminellaThu, 12 Ma💻 cs

Aceso: Carbon-Aware and Cost-Effective Microservice Placement for Small and Medium-sized Enterprises

Aceso is an adaptive placement system designed for small and medium-sized enterprises that dynamically schedules microservices across geographically constrained regions to significantly reduce carbon emissions and operational costs while meeting latency requirements, specifically addressing the limitations of existing solutions that assume access to global-scale infrastructure.

Georgia Christofidi, Francisco Álvarez-Terribas, Ioannis Roumpos, Nicolas Kourtellis, Jesus Omaña Iglesias, Thaleia Dimitra DoudaliThu, 12 Ma💻 cs

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

This paper proposes a novel technique to emulate double-precision (FP64) matrix multiplication using FP8 matrix multiply-accumulate units via the Ozaki-II scheme, overcoming previous algorithmic limitations to significantly reduce computational requirements and enable efficient high-accuracy performance on emerging GPU architectures.

Yuki Uchino, Katsuhisa Ozaki, Toshiyuki ImamuraThu, 12 Ma💻 cs

CD-Raft: Reducing the Latency of Distributed Consensus in Cross-Domain Sites

This paper introduces CD-Raft, an optimized Raft protocol that reduces cross-domain consensus latency by strategically positioning the leader node and optimizing round-trip times, achieving significant improvements in average and tail latency while maintaining strong consistency verified via TLA+.

Yangyang Wang, Ziqian Cheng, Yucong Dong, Zichen XuThu, 12 Ma💻 cs

COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints

This paper presents COHORT, a ROS-based collaborative framework for multi-robot systems that leverages a hybrid offline-online reinforcement learning strategy to dynamically distribute large DNN inference tasks, achieving significant improvements in battery efficiency, GPU utilization, and deadline compliance under real-time constraints.

Mohammad Saeid Anwar, Anuradha Ravi, Indrajeet Ghosh, Gaurav Shinde, Carl Busart, Nirmalya RoyThu, 12 Ma💻 cs

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

This paper introduces S-HPLB, a novel attention deployment strategy that leverages the heterogeneous yet stable sparsity elasticities of LLM attention heads to dynamically balance sparsity budgets across GPUs, thereby eliminating cross-GPU resource bubbles and achieving a 2.88x improvement in attention computation latency without compromising inference quality.

Di Liu, Yifei Liu, Chen Chen, Zhibin Yu, Xiaoyi Fan, Quan Chen, Minyi GuoThu, 12 Ma💻 cs

AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

AgentServe is a novel single-GPU serving system designed for efficient agentic AI workloads that isolates prefill and decode phases, employs dynamic budgeting, and utilizes adaptive CUDA resource allocation to significantly reduce latency and stabilize performance for consumer-grade hardware.

Yuning Zhang, Yan Yan, Nan Yang, Dong YuanThu, 12 Ma💻 cs

Communication-Efficient Multimodal Federated Learning: Joint Modality and Client Selection

This paper proposes MFedMC, a communication-efficient multimodal federated learning framework that employs a decoupled architecture and a joint modality-client selection strategy to address data heterogeneity and bandwidth constraints, achieving comparable accuracy to baselines while reducing communication overhead by over 20 times.

Liangqi Yuan, Dong-Jun Han, Su Wang, Devesh Upadhyay, Christopher G. BrintonThu, 12 Ma🤖 cs.LG

CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems

CacheSolidarity is a lightweight system that secures multi-tenant LLM serving against Automatic Prefix Caching side-channel attacks by selectively isolating suspicious cache reuse, thereby achieving significantly higher cache efficiency and lower latency compared to existing all-or-nothing isolation defenses.

Panagiotis Georgios Pennas, Konstantinos Papaioannou, Marco Guarnieri, Thaleia Dimitra DoudaliThu, 12 Ma🤖 cs.LG

ACE Runtime - A ZKP-Native Blockchain Runtime with Sub-Second Cryptographic Finality

This paper introduces ACE Runtime, a ZKP-native blockchain architecture that achieves sub-second cryptographic finality by separating identity authorization from execution to replace per-transaction signature verification with lightweight attestations and a single aggregated zero-knowledge proof per block.

Jian Sheng WangThu, 12 Ma💻 cs

Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

This paper introduces MPI into the QED-C benchmarks to evaluate multi-GPU quantum circuit simulations, demonstrating that while GPU architecture improvements yield significant speedups, advancements in interconnect technology provide even greater performance gains, with the new NVIDIA Grace Blackwell NVL72 architecture delivering over 16X faster time-to-solution.

W. Michael Brown, Anurag Ramesh, Thomas Lubinski, Thien Nguyen, David E. Bernal NeiraThu, 12 Ma⚛️ quant-ph

Optimal Transport Aggregation for Distributed Mixture-of-Experts

This paper proposes an optimal transport-based aggregation framework that efficiently combines locally trained Mixture-of-Experts models into a global estimator with a single communication step, achieving performance comparable to centralized training while significantly reducing computational and communication costs.

Faïcel Chamroukhi, Nhat Thien PhamThu, 12 Ma📊 stat

Pooling Engram Conditional Memory in Large Language Models using CXL

This paper proposes a scalable and cost-efficient solution for Large Language Models by integrating Compute Express Link (CXL) memory pools into SGLang to store Engram conditional memory, achieving near-DRAM end-to-end performance while overcoming the latency limitations of traditional RDMA approaches.

Ruiyang Ma, Teng Ma, Zhiyuan Su, Hantian Zha, Xinpeng Zhao, Xuchun Shang, Xingrui Yi, Zheng Liu, Zhu Cao, An Wu, Zhichong Dou, Ziqian Liu, Daikang Kuang, Guojie LuoThu, 12 Ma💻 cs

Estimating the condition number of Chebyshev filtered vectors with application to the ChASE library

This paper presents a method for obtaining precise and inexpensive upper bounds on the condition number of Chebyshev filtered vectors, which is then used to implement an adaptive QR-factorization selection mechanism in the ChASE library that enhances performance without compromising accuracy.

Edoardo Di Napoli, Xinzhe WuThu, 12 Ma🔢 math

Reference Architecture of a Quantum-Centric Supercomputer

This paper presents a reference architecture and roadmap for Quantum-Centric Supercomputing (QCSC) systems that integrate quantum, GPU, and CPU resources to overcome current isolation challenges and enable seamless, high-performance hybrid workflows across three evolutionary phases.

Seetharami Seelam, Jerry M. Chow, Antonio Córcoles, Sarah Sheldon, Tushar Mittal, Abhinav Kandala, Sean Dague, Ian Hincks, Hiroshi Horii, Blake Johnson, Michael Le, Hani Jamjoom, Jay M. GambettaThu, 12 Ma⚡ eess

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

This paper presents a comprehensive benchmark of production LLM inference on AMD Instinct MI325X GPUs, demonstrating that architecture-aware optimizations—specifically the selective use of the AITER runtime and specific KV cache configurations—are critical for maximizing throughput across diverse model families while maintaining high reliability under heavy concurrency.

Athos GeorgiouThu, 12 Ma🤖 cs.AI

The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths

This paper introduces dmaplane, a Linux kernel module that provides explicit kernel-level buffer orchestration for high-performance AI data paths by integrating features like DMA lifecycle management, NUMA-aware allocation, and RDMA-based cross-device sharing to enable efficient, safe, and disaggregated AI inference.

Marco GrazianoThu, 12 Ma🤖 cs.AI

RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators

RedFuser is an automatic framework that employs a formal theoretical methodology to identify and fuse cascaded reduction operations into optimized single-loop kernels, achieving significant speedups over state-of-the-art AI compilers while matching the performance of hand-written solutions.

Xinsheng Tang, Yangcheng Li, Nan Wang, Zhiyi Shu, Xingyu Ling, Junna Xing, Peng Zhou, Qiang LiuThu, 12 Ma🤖 cs.AI

← Previous Next →