Data Augmentation and Convolutional Network Architecture Influence on Distributed Learning

This paper investigates how convolutional neural network architectures and data augmentation strategies impact model accuracy and computational efficiency within distributed learning environments, aiming to provide insights for optimizing CNN deployment in resource-intensive scenarios.

Victor Forattini Jansen, Emanuel Teixeira Martins, Yasmin Souza Lima, Flavio de Oliveira Silva, Rodrigo Moreira, Larissa Ferreira Rodrigues MoreiraThu, 12 Ma💻 cs

Aceso: Carbon-Aware and Cost-Effective Microservice Placement for Small and Medium-sized Enterprises

Aceso is an adaptive placement system designed for small and medium-sized enterprises that dynamically schedules microservices across geographically constrained regions to significantly reduce carbon emissions and operational costs while meeting latency requirements, specifically addressing the limitations of existing solutions that assume access to global-scale infrastructure.

Georgia Christofidi, Francisco Álvarez-Terribas, Ioannis Roumpos, Nicolas Kourtellis, Jesus Omaña Iglesias, Thaleia Dimitra DoudaliThu, 12 Ma💻 cs

COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints

This paper presents COHORT, a ROS-based collaborative framework for multi-robot systems that leverages a hybrid offline-online reinforcement learning strategy to dynamically distribute large DNN inference tasks, achieving significant improvements in battery efficiency, GPU utilization, and deadline compliance under real-time constraints.

Mohammad Saeid Anwar, Anuradha Ravi, Indrajeet Ghosh, Gaurav Shinde, Carl Busart, Nirmalya RoyThu, 12 Ma💻 cs

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

This paper introduces S-HPLB, a novel attention deployment strategy that leverages the heterogeneous yet stable sparsity elasticities of LLM attention heads to dynamically balance sparsity budgets across GPUs, thereby eliminating cross-GPU resource bubbles and achieving a 2.88x improvement in attention computation latency without compromising inference quality.

Di Liu, Yifei Liu, Chen Chen, Zhibin Yu, Xiaoyi Fan, Quan Chen, Minyi GuoThu, 12 Ma💻 cs

Communication-Efficient Multimodal Federated Learning: Joint Modality and Client Selection

This paper proposes MFedMC, a communication-efficient multimodal federated learning framework that employs a decoupled architecture and a joint modality-client selection strategy to address data heterogeneity and bandwidth constraints, achieving comparable accuracy to baselines while reducing communication overhead by over 20 times.

Liangqi Yuan, Dong-Jun Han, Su Wang, Devesh Upadhyay, Christopher G. BrintonThu, 12 Ma🤖 cs.LG

CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems

CacheSolidarity is a lightweight system that secures multi-tenant LLM serving against Automatic Prefix Caching side-channel attacks by selectively isolating suspicious cache reuse, thereby achieving significantly higher cache efficiency and lower latency compared to existing all-or-nothing isolation defenses.

Panagiotis Georgios Pennas, Konstantinos Papaioannou, Marco Guarnieri, Thaleia Dimitra DoudaliThu, 12 Ma🤖 cs.LG

Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

This paper introduces MPI into the QED-C benchmarks to evaluate multi-GPU quantum circuit simulations, demonstrating that while GPU architecture improvements yield significant speedups, advancements in interconnect technology provide even greater performance gains, with the new NVIDIA Grace Blackwell NVL72 architecture delivering over 16X faster time-to-solution.

W. Michael Brown, Anurag Ramesh, Thomas Lubinski, Thien Nguyen, David E. Bernal NeiraThu, 12 Ma⚛️ quant-ph

Pooling Engram Conditional Memory in Large Language Models using CXL

This paper proposes a scalable and cost-efficient solution for Large Language Models by integrating Compute Express Link (CXL) memory pools into SGLang to store Engram conditional memory, achieving near-DRAM end-to-end performance while overcoming the latency limitations of traditional RDMA approaches.

Ruiyang Ma, Teng Ma, Zhiyuan Su, Hantian Zha, Xinpeng Zhao, Xuchun Shang, Xingrui Yi, Zheng Liu, Zhu Cao, An Wu, Zhichong Dou, Ziqian Liu, Daikang Kuang, Guojie LuoThu, 12 Ma💻 cs

Reference Architecture of a Quantum-Centric Supercomputer

This paper presents a reference architecture and roadmap for Quantum-Centric Supercomputing (QCSC) systems that integrate quantum, GPU, and CPU resources to overcome current isolation challenges and enable seamless, high-performance hybrid workflows across three evolutionary phases.

Seetharami Seelam, Jerry M. Chow, Antonio Córcoles, Sarah Sheldon, Tushar Mittal, Abhinav Kandala, Sean Dague, Ian Hincks, Hiroshi Horii, Blake Johnson, Michael Le, Hani Jamjoom, Jay M. GambettaThu, 12 Ma⚡ eess

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

This paper presents a comprehensive benchmark of production LLM inference on AMD Instinct MI325X GPUs, demonstrating that architecture-aware optimizations—specifically the selective use of the AITER runtime and specific KV cache configurations—are critical for maximizing throughput across diverse model families while maintaining high reliability under heavy concurrency.

Athos GeorgiouThu, 12 Ma🤖 cs.AI