Performance Evaluation of Automated Multi-Service Deployment in Edge-Cloud Environments with the CODECO Toolkit

This paper evaluates the open-source CODECO toolkit, demonstrating that it significantly reduces manual intervention and maintains competitive performance compared to baseline Kubernetes workflows for automating multi-service deployments across heterogeneous Edge-Cloud environments.

Georgios Koukis, Ioannis Dermentzis, Vassilis Tsaoussidis, Jan Lenke, Fabian Wolk, Daniel Uceda, Guillermo Sanchez, Miguel A. Puentes, Javier Serrano, Panagiotis Karamolegkos, Rute C. SofiaTue, 10 Ma💻 cs

Agentic AI-Driven UAV Network Deployment: A LLM-Enhanced Exact Potential Game Approach

This paper proposes a dual spatial-scale UAV network optimization framework that combines exact potential game algorithms for link configuration and deployment with a large language model to dynamically generate utility weights, thereby enhancing adaptability and performance in terms of energy efficiency, latency, and throughput.

Xin Tang, Qian Chen, Binhan Liao, Yaqi Zhang, Jianxin Chen, Changyuan Zhao, Junchuan Fan, Junxi Tian, Xiaohuan LiTue, 10 Ma💻 cs

Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

Uber's Failover Architecture (UFA) replaces its costly uniform 2x capacity model with a differentiated, criticality-based approach that opportunistically shares resources and preempts non-critical services during peak failovers, thereby reducing steady-state provisioning from 2x to 1.3x and eliminating over one million CPU cores while maintaining 99.97% availability.

Mayank Bansal, Milind Chabbi, Kenneth Bogh, Srikanth Prodduturi, Kevin Xu, Amit Kumar, David Bell, Ranjib Dey, Yufei Ren, Sachin Sharma, Juan Marcano, Shriniket Kale, Subhav Pradhan, Ivan Beschastnikh, Miguel Covarrubias, Chien-Chih Liao, Sandeep Koushik Sheshadri, Wen Luo, Kai Song, Ashish Samant, Sahil Rihan, Nimish Sheth, Uday Kiran MedisettyTue, 10 Ma💻 cs

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

The paper presents NANOMIND, a hardware-software co-design framework that decomposes Large Multimodal Models into modular components and dynamically schedules them across heterogeneous accelerators on unified-memory SoCs, enabling a battery-powered device to run LMMs entirely on-device with significantly improved energy efficiency and throughput.

Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman BanerjeeTue, 10 Ma💬 cs.CL

EROICA: Online Performance Troubleshooting for Large-scale Model Training

This paper presents EROICA, the first online troubleshooting system deployed on production-scale GPU clusters (~100,000 GPUs) that effectively diagnoses complex hardware and software performance issues in large-scale model training through fine-grained profiling and differential observability with minimal impact.

Yu Guan, Zhiyu Yin, Haoyu Chen, Sheng Cheng, Chaojie Yang, Kun Qian, Tianyin Xu, Pengcheng Zhang, Yang Zhang, Hanyu Zhao, Yong Li, Wei Lin, Dennis Cai, Ennan ZhaiTue, 10 Ma🤖 cs.LG

General Coded Computing in a Probabilistic Straggler Regime

This paper theoretically demonstrates that in distributed computing systems with probabilistic stragglers, the approximation errors of Berrut Approximate Coded Computing (BACC) and Learning Theoretic Coded Computing (LeTCC) schemes converge to zero at specific rates despite the average number of stragglers scaling with the total server count, a finding validated through experiments on various functions including deep neural networks.

Parsa Moradi, Mohammad Ali Maddah-AliTue, 10 Ma🤖 cs.LG

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

The paper introduces Covenant-72B, a 72-billion-parameter language model successfully pre-trained on 1.1 trillion tokens through the largest permissionless, globally distributed collaboration to date, demonstrating that open, blockchain-supported participation can achieve performance competitive with centralized training at unprecedented scale.

Joel Lidin, Amir Sarfi, Erfan Miahi, Quentin Anthony, Shivam Chauhan, Evangelos Pappas, Benjamin Thérien, Eugene Belilovsky, Samuel DareTue, 10 Ma🤖 cs.LG

Scalable Training of Mixture-of-Experts Models with Megatron Core

This paper presents Megatron Core, a scalable and production-ready open-source framework that addresses the coupled memory, communication, and computation challenges of Mixture-of-Experts (MoE) training through integrated system-level optimizations, enabling high-performance training of models ranging from billions to trillions of parameters on large-scale GPU clusters.

Zijie Yan (NVIDIA), Hongxiao Bai (NVIDIA), Xin Yao (NVIDIA), Dennis Liu (NVIDIA), Tong Liu (NVIDIA), Hongbin Liu (NVIDIA), Pingtian Li (NVIDIA), Evan Wu (NVIDIA), Shiqing Fan (NVIDIA), Li Tao (NVIDIA), Robin Zhang (NVIDIA), Yuzhong Wang (NVIDIA), Shifang Xu (NVIDIA), Jack Chang (NVIDIA), Xuwen Chen (NVIDIA), Kunlun Li (NVIDIA), Yan Bai (NVIDIA), Gao Deng (NVIDIA), Nan Zheng (NVIDIA), Vijay Anand Korthikanti (NVIDIA), Abhinav Khattar (NVIDIA), Ethan He (NVIDIA), Soham Govande (NVIDIA), Sangkug Lym (NVIDIA), Zhongbo Zhu (NVIDIA), Qi Zhang (NVIDIA), Haochen Yuan (NVIDIA), Xiaowei Ren (NVIDIA), Deyu Fu (NVIDIA), Tailai Ma (NVIDIA), Shunkang Zhang (NVIDIA), Jiang Shao (NVIDIA), Ray Wang (NVIDIA), Santosh Bhavani (NVIDIA), Xipeng Li (NVIDIA), Chandler Zhou (NVIDIA), David Wu (NVIDIA), Yingcan Wei (NVIDIA), Ashwath Aithal (NVIDIA), Michael Andersch (NVIDIA), Mohammad Shoeybi (NVIDIA), Jiajie Yao (NVIDIA), June Yang (NVIDIA)Tue, 10 Ma🤖 cs.LG