cs.DC 篇论文 | Gist.Science

Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration

本文提出了 Nezha 系统，通过创新性地结合键值分离架构与 Raft 共识协议，优化了持久化策略并引入分级垃圾回收机制，有效解决了传统一致存储中因重叠 I/O 操作导致的性能瓶颈，显著提升了读写吞吐量。

Yangyang Wang, Yucong Dong, Ziqian Cheng, Zichen XuWed, 11 Ma💻 cs

Hierarchical Observe-Orient-Decide-Act Enabled UAV Swarms in Uncertain Environments: Frameworks, Potentials, and Challenges

本文提出了一种基于云 - 边 - 端分层架构和网络功能虚拟化技术的分层观察 - 调整 - 决策 - 行动（H-OODA）框架，旨在通过融合自主决策与协同控制，提升无人机群在不确定环境下的适应性、可扩展性及决策效率。

Ziye Jia, Yao Wu, Qihui Wu, Lijun He, Qiuming Zhu, Fuhui Zhou, Zhu HanWed, 11 Ma💻 cs

PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

本文提出了 PIM-SHERPA，一种纯软件方法，通过解决存内计算（PIM）系统中预填充与解码阶段存在的内存属性不一致及权重布局不一致问题，实现了在 Llama 3.2 模型上以接近理论最大性能运行，同时节省约 47.8% 至 49.7% 的内存容量。

Sunjung Lee, Sanghoon Cha, Hyeonsu Kim, Seungwoo Seo, Yuhwan Ro, Sukhan Lee, Byeongho Kim, Yongjun Park, Kyomin Sohn, Seungwon Lee, Jaehoon YuWed, 11 Ma💻 cs

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

本文提出了 Flash-KMeans，一种专为现代 GPU 设计的 IO 感知且无争用的 K-Means 实现，通过引入 FlashAssign 和 sort-inverse update 等内核级创新，成功将 K-Means 从离线处理转变为高效的在线原语，在 NVIDIA H200 上实现了远超现有库（如 cuML 和 FAISS）的显著加速。

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Xiaoze Fan, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Kurt Keutzer, Song Han, Chenfeng Xu, Ion StoicaWed, 11 Ma💻 cs

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

该论文提出了一种基于 XLA 编译器的 Mamba-2 状态空间模型实现方案，通过仅使用标准算子而非定制 CUDA 内核，在 CPU、NVIDIA GPU 和 Google TPU 上实现了可移植的 $O(1)$ 自回归缓存推理，并达到了与 PyTorch/CUDA 参考实现一致的精度和显著的性能。

Cosmo SantoniWed, 11 Ma🤖 cs.AI

Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers

本文通过详细案例研究，展示了由异构 XRootD 虚拟机集群、BBR 拥塞控制算法及 TCP 扩展技术构成的 T2_BR_SPRACE 存储前端架构，在真实生产负载下成功实现了高达 51.3 Gb/s 的聚合吞吐量及单流 41.5 Gb/s 的传输峰值性能。

J M da Silva, M A Costa, R L IopeWed, 11 Ma💻 cs

Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy

该论文提出了随机化分布式函数计算（RDFC）框架，将其作为一种语义通信范式，证明了在无需共享随机性的情况下即可实现本地差分隐私，并揭示了共享随机性可显著降低通信速率，使其成为隐私感知分布式系统的高效策略。

Onur GünlüWed, 11 Ma⚡ eess

Multi-DNN Inference of Sparse Models on Edge SoCs

本文提出了名为 SparseLoom 的演示系统，通过无需重训练的模型拼接技术从稀疏模型中生成变体，从而在边缘 SoC 上实现多 DNN 推理，显著降低了服务等级目标违规率并提升了吞吐量与内存效率。

Jiawei Luo, Di Wu, Simon Dobson, Blesson VargheseWed, 11 Ma🤖 cs.LG

Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

本文提出了一种基于数据新鲜度约束的任务调度框架，通过引入任务偏移量实现数据生产的准时制（JIT）同步，并借助主导路径分解与共识偏移搜索算法，在消除冗余采样和人为延迟的同时，确保了多速率任务链的端到端数据新鲜度并维持了全局 EDF 的 100% 可调度性。

José Luis Conradi Hoffmann, Antônio Augusto FröhlichWed, 11 Ma💻 cs

Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices

本文针对科学计算中广泛使用的基于分块架构的有损压缩器，建立了一个适用于有限格点上非均匀随机场的有限块长率失真理论框架，推导了非渐近界并量化了空间相关性、区域几何、异质性及分块尺寸对压缩率与分散度的影响。

Sujata Sinha, Vishwas Rao, Robert Underwood, David Lenz, Sheng Di, Franck Cappello, Lingjia LiuWed, 11 Ma🔢 math

The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

该论文提出了一种名为“能力一致性系统”（CCS）的新框架，通过将内存一致性模型（如 MESI）映射到身份授权场景，证明了基于发布一致性（RCC）的撤销策略在高速代理执行环境中能将未授权操作数量从时间依赖的线性增长降低至与代理速度无关的常数级，从而在根本上解决了传统基于时间窗口的访问控制机制在大规模并发下的安全性失效问题。

Vladyslav ParakhinWed, 11 Ma💻 cs

General Coded Computing in a Probabilistic Straggler Regime

本文针对分布式计算中服务器独立以概率 $p$ 发生延迟的通用编码计算场景，理论证明了 BACC 和 LeTCC 两种方案的平均近似误差均能以特定速率收敛至零，并通过实验验证了该结论在包括深度神经网络在内的多种任务中的有效性。

Parsa Moradi, Mohammad Ali Maddah-AliTue, 10 Ma🤖 cs.LG

EROICA: Online Performance Troubleshooting for Large-scale Model Training

本文介绍了 EROICA，这是首个面向大规模模型训练的在线性能故障诊断系统，它通过在线剖析和差异可观测性技术，在几乎不影响生产环境的前提下，实现了对涵盖约 10 万张 GPU 集群中软硬件混合故障的细粒度、全覆盖诊断，并在实际部署中取得了 97.5% 的成功率。

Yu Guan, Zhiyu Yin, Haoyu Chen, Sheng Cheng, Chaojie Yang, Kun Qian, Tianyin Xu, Pengcheng Zhang, Yang Zhang, Hanyu Zhao, Yong Li, Wei Lin, Dennis Cai, Ennan ZhaiTue, 10 Ma🤖 cs.LG

Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

本文针对现实场景中数据与模型异构的挑战，提出了任务相关性感知的聚合策略及维度不变模块 Co-LoRA，并构建了涵盖 40 个任务的多模态基准，显著提升了个性化联邦学习在异构环境下的性能。

Minhyuk Seo, Taeheon Kim, Hankook Lee, Jonghyun Choi, Tinne TuytelaarsTue, 10 Ma🤖 cs.LG

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

本文提出了名为 NANOMIND 的软硬件协同设计框架，通过将大型多模态模型模块化并动态调度至异构加速器，在电池供电的小型设备上实现了无需联网的高效、低功耗本地推理，显著降低了能耗与显存占用。

Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman BanerjeeTue, 10 Ma💬 cs.CL

Mayank Bansal, Milind Chabbi, Kenneth Bogh, Srikanth Prodduturi, Kevin Xu, Amit Kumar, David Bell, Ranjib Dey, Yufei Ren, Sachin Sharma, Juan Marcano, Shriniket Kale, Subhav Pradhan, Ivan Beschastnikh, Miguel Covarrubias, Chien-Chih Liao, Sandeep Koushik Sheshadri, Wen Luo, Kai Song, Ashish Samant, Sahil Rihan, Nimish Sheth, Uday Kiran MedisettyTue, 10 Ma💻 cs

cs.DC

Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration

Hierarchical Observe-Orient-Decide-Act Enabled UAV Swarms in Uncertain Environments: Frameworks, Potentials, and Challenges

PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers

Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy

Multi-DNN Inference of Sparse Models on Edge SoCs

Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices

The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

General Coded Computing in a Probabilistic Straggler Regime

EROICA: Online Performance Troubleshooting for Large-scale Model Training

Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

The Need for Quantitative Resilience Models and Metrics in Classical-Quantum Computing Systems

NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

Configurable Runtime Orchestration for Dynamic Data Retrieval in Distributed Systems

AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling

Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

cs.DC

Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration

Hierarchical Observe-Orient-Decide-Act Enabled UAV Swarms in Uncertain Environments: Frameworks, Potentials, and Challenges

PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Compiler-First State Space Duality and Portable O(1)O(1)O(1) Autoregressive Caching for Inference

Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers

Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy

Multi-DNN Inference of Sparse Models on Edge SoCs

Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices

The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

General Coded Computing in a Probabilistic Straggler Regime

EROICA: Online Performance Troubleshooting for Large-scale Model Training

Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

The Need for Quantitative Resilience Models and Metrics in Classical-Quantum Computing Systems

NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

Configurable Runtime Orchestration for Dynamic Data Retrieval in Distributed Systems

AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling

Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference