cs.DC papers | Gist.Science

Structured Gossip: A Partition-Resilient DNS for Internet-Scale Dynamic Networks

This paper introduces Structured Gossip DNS, a partition-resilient name resolution system for large-scale dynamic networks that leverages DHT finger tables and passive stabilization to achieve eventual consistency with reduced message complexity and without requiring global coordination.

Priyanka Sinha, Dilys ThomasTue, 10 Ma💻 cs

Performance Evaluation of Automated Multi-Service Deployment in Edge-Cloud Environments with the CODECO Toolkit

This paper evaluates the open-source CODECO toolkit, demonstrating that it significantly reduces manual intervention and maintains competitive performance compared to baseline Kubernetes workflows for automating multi-service deployments across heterogeneous Edge-Cloud environments.

Georgios Koukis, Ioannis Dermentzis, Vassilis Tsaoussidis, Jan Lenke, Fabian Wolk, Daniel Uceda, Guillermo Sanchez, Miguel A. Puentes, Javier Serrano, Panagiotis Karamolegkos, Rute C. SofiaTue, 10 Ma💻 cs

Agentic AI-Driven UAV Network Deployment: A LLM-Enhanced Exact Potential Game Approach

This paper proposes a dual spatial-scale UAV network optimization framework that combines exact potential game algorithms for link configuration and deployment with a large language model to dynamically generate utility weights, thereby enhancing adaptability and performance in terms of energy efficiency, latency, and throughput.

Xin Tang, Qian Chen, Binhan Liao, Yaqi Zhang, Jianxin Chen, Changyuan Zhao, Junchuan Fan, Junxi Tian, Xiaohuan LiTue, 10 Ma💻 cs

Link Wars: The Semantic Crisis. Is the debate over or is it just beginning?

This paper argues that the current fragmentation in high-performance interconnects stems from a fundamental "semantic crisis" caused by implicit, vendor-specific assumptions about time and ordering, and proposes that adopting explicit, testable link semantics through the Open Atomic Ethernet (OAE) standard is essential to restore correctness and enable convergence.

Paul BorrillTue, 10 Ma💻 cs

Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

Uber's Failover Architecture (UFA) replaces its costly uniform 2x capacity model with a differentiated, criticality-based approach that opportunistically shares resources and preempts non-critical services during peak failovers, thereby reducing steady-state provisioning from 2x to 1.3x and eliminating over one million CPU cores while maintaining 99.97% availability.

Mayank Bansal, Milind Chabbi, Kenneth Bogh, Srikanth Prodduturi, Kevin Xu, Amit Kumar, David Bell, Ranjib Dey, Yufei Ren, Sachin Sharma, Juan Marcano, Shriniket Kale, Subhav Pradhan, Ivan Beschastnikh, Miguel Covarrubias, Chien-Chih Liao, Sandeep Koushik Sheshadri, Wen Luo, Kai Song, Ashish Samant, Sahil Rihan, Nimish Sheth, Uday Kiran MedisettyTue, 10 Ma💻 cs

AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling

The paper introduces AIReSim, a discrete event simulator designed to help system designers evaluate and tune reliability mechanisms, prioritize improvements, and plan capacity for large-scale AI clusters by modeling the complex tradeoffs involved in failure, recovery, scheduling, and repair processes.

Karthik Pattabiraman, Mihir Patel, Fred LinTue, 10 Ma💻 cs

Configurable Runtime Orchestration for Dynamic Data Retrieval in Distributed Systems

This paper introduces a configuration-driven runtime orchestration framework that dynamically generates execution graphs from configuration at request time, enabling low-latency, dependency-aware parallel data retrieval across distributed microservices and APIs without requiring workflow code redeployment.

Abhiram KandirajuTue, 10 Ma💻 cs

A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture

This paper presents a fully device-resident, multi-GPU architecture that achieves near-zero host-device communication and 99.7% parallel efficiency through lock-free work-stealing and optimized shared-memory tiling, enabling the verification of Goldbach's conjecture up to $10^{13}$ in just 133.5 seconds on a four-GPU system with a 45.6× speedup over previous methods.

Isaac Llorente-SaguerTue, 10 Ma🔢 math

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

The paper presents NANOMIND, a hardware-software co-design framework that decomposes Large Multimodal Models into modular components and dynamically schedules them across heterogeneous accelerators on unified-memory SoCs, enabling a battery-powered device to run LMMs entirely on-device with significantly improved energy efficiency and throughput.

Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman BanerjeeTue, 10 Ma💬 cs.CL

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

ArcLight is a lightweight LLM inference architecture designed specifically for many-core CPUs that overcomes cross-NUMA memory access bottlenecks through efficient memory management, thread scheduling, and controlled tensor parallelism, achieving up to 46% higher throughput than mainstream frameworks while maintaining broad device compatibility.

Yuzhuang Xu, Xu Han, Yuxuan Li, Wanxiang CheTue, 10 Ma💬 cs.CL

EROICA: Online Performance Troubleshooting for Large-scale Model Training

This paper presents EROICA, the first online troubleshooting system deployed on production-scale GPU clusters (~100,000 GPUs) that effectively diagnoses complex hardware and software performance issues in large-scale model training through fine-grained profiling and differential observability with minimal impact.

Yu Guan, Zhiyu Yin, Haoyu Chen, Sheng Cheng, Chaojie Yang, Kun Qian, Tianyin Xu, Pengcheng Zhang, Yang Zhang, Hanyu Zhao, Yong Li, Wei Lin, Dennis Cai, Ennan ZhaiTue, 10 Ma🤖 cs.LG

General Coded Computing in a Probabilistic Straggler Regime

This paper theoretically demonstrates that in distributed computing systems with probabilistic stragglers, the approximation errors of Berrut Approximate Coded Computing (BACC) and Learning Theoretic Coded Computing (LeTCC) schemes converge to zero at specific rates despite the average number of stragglers scaling with the total server count, a finding validated through experiments on various functions including deep neural networks.

Parsa Moradi, Mohammad Ali Maddah-AliTue, 10 Ma🤖 cs.LG

Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

This paper introduces Co-LoRA, a collaborative personalization framework that addresses both data and model heterogeneity through a task-relevance-aware aggregation strategy and a dimension-invariant module, validated by a new multi-modal benchmark and superior performance over state-of-the-art methods.

Minhyuk Seo, Taeheon Kim, Hankook Lee, Jonghyun Choi, Tinne TuytelaarsTue, 10 Ma🤖 cs.LG

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

The paper introduces Covenant-72B, a 72-billion-parameter language model successfully pre-trained on 1.1 trillion tokens through the largest permissionless, globally distributed collaboration to date, demonstrating that open, blockchain-supported participation can achieve performance competitive with centralized training at unprecedented scale.

Joel Lidin, Amir Sarfi, Erfan Miahi, Quentin Anthony, Shivam Chauhan, Evangelos Pappas, Benjamin Thérien, Eugene Belilovsky, Samuel DareTue, 10 Ma🤖 cs.LG

Scalable Training of Mixture-of-Experts Models with Megatron Core

This paper presents Megatron Core, a scalable and production-ready open-source framework that addresses the coupled memory, communication, and computation challenges of Mixture-of-Experts (MoE) training through integrated system-level optimizations, enabling high-performance training of models ranging from billions to trillions of parameters on large-scale GPU clusters.

Zijie Yan (NVIDIA), Hongxiao Bai (NVIDIA), Xin Yao (NVIDIA), Dennis Liu (NVIDIA), Tong Liu (NVIDIA), Hongbin Liu (NVIDIA), Pingtian Li (NVIDIA), Evan Wu (NVIDIA), Shiqing Fan (NVIDIA), Li Tao (NVIDIA), Robin Zhang (NVIDIA), Yuzhong Wang (NVIDIA), Shifang Xu (NVIDIA), Jack Chang (NVIDIA), Xuwen Chen (NVIDIA), Kunlun Li (NVIDIA), Yan Bai (NVIDIA), Gao Deng (NVIDIA), Nan Zheng (NVIDIA), Vijay Anand Korthikanti (NVIDIA), Abhinav Khattar (NVIDIA), Ethan He (NVIDIA), Soham Govande (NVIDIA), Sangkug Lym (NVIDIA), Zhongbo Zhu (NVIDIA), Qi Zhang (NVIDIA), Haochen Yuan (NVIDIA), Xiaowei Ren (NVIDIA), Deyu Fu (NVIDIA), Tailai Ma (NVIDIA), Shunkang Zhang (NVIDIA), Jiang Shao (NVIDIA), Ray Wang (NVIDIA), Santosh Bhavani (NVIDIA), Xipeng Li (NVIDIA), Chandler Zhou (NVIDIA), David Wu (NVIDIA), Yingcan Wei (NVIDIA), Ashwath Aithal (NVIDIA), Michael Andersch (NVIDIA), Mohammad Shoeybi (NVIDIA), Jiajie Yao (NVIDIA), June Yang (NVIDIA)Tue, 10 Ma🤖 cs.LG

Mitigating the Memory Bottleneck with Machine Learning-Driven and Data-Aware Microarchitectural Techniques

This dissertation addresses the memory bottleneck in modern computing by advocating a shift from data-agnostic to data-informed microarchitectural designs, proposing four machine learning-driven and data-aware mechanisms that significantly enhance performance and energy efficiency.

Rahul BeraTue, 10 Ma🤖 cs.LG

MAS-H2: A Hierarchical Multi-Agent System for Holistic Cloud-Native Autoscaling

This paper introduces MAS-H2, a hierarchical multi-agent system for Kubernetes that bridges the gap between business policies and resource provisioning through strategic, planning, and execution agents, demonstrating significant reductions in CPU stress and peak load while enabling zero-downtime infrastructure migrations compared to native autoscalers.

Hamed Hamzeh, Parisa VahdatianTue, 10 Ma🤖 cs.LG

TA-RNN-Medical-Hybrid: A Time-Aware and Interpretable Framework for Mortality Risk Prediction

The paper proposes TA-RNN-Medical-Hybrid, a time-aware and interpretable deep learning framework that integrates continuous-time encoding, SNOMED-based disease representations, and a hierarchical dual-level attention mechanism to accurately predict ICU mortality risk while providing clinically meaningful explanations.

Zahra Jafari, Azadeh Zamanifar, Amirfarhad FarhadiTue, 10 Ma🤖 cs.LG

NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

NEST is a novel device placement framework that unifies network, compute, and memory awareness through structured dynamic programming to jointly optimize hybrid parallelism strategies, achieving up to 2.43 times higher throughput and improved scalability compared to state-of-the-art baselines.

Irene Wang, Vishnu Varma Venkata, Arvind Krishnamurthy, Divya MahajanTue, 10 Ma🤖 cs.LG

The Need for Quantitative Resilience Models and Metrics in Classical-Quantum Computing Systems

This paper argues that resilience must be established as a fundamental design constraint in integrated classical-quantum computing systems by developing new quantitative models and metrics, drawing inspiration from civil engineering to accurately assess the cost-benefit ratios of system improvements and their cascading impacts on end-user value.

Santiago Núñez-CorralesTue, 10 Ma⚛️ quant-ph

← Previous Next →