A Systematic Evaluation of the Potential of Carbon-Aware Execution for Scientific Workflows

This paper systematically evaluates the potential of carbon-aware execution strategies for scientific workflows, demonstrating that leveraging their inherent flexibility through temporal shifting and dynamic resource scaling can reduce carbon emissions by over 80% and 67%, respectively.

Kathleen West, Youssef Moawad, Fabian Lehmann, Vasilis Bountris, Ulf Leser, Yehia Elkhatib, Lauritz ThamsenMon, 09 Ma💻 cs

Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F2\mathbb{F}_2

This paper introduces "Linear Layouts," a novel framework that models tensor layouts as linear algebra operations over F2\mathbb{F}_2 to enable generic, efficient, and bug-free layout definitions and conversions for deep learning workloads, successfully integrating with the Triton compiler to overcome the limitations of existing case-by-case approaches.

Keren Zhou, Mario Lezcano, Adam Goucher, Akhmed Rakhmati, Jeff Niu, Justin Lebar, Pawel Szczerbuk, Peter Bell, Phil Tillet, Thomas Raoux, Zahi MoudallalMon, 09 Ma💻 cs

FAST: An Efficient Scheduler for All-to-All GPU Communication

FAST is an efficient scheduler designed to overcome the scalability and performance limitations of existing solutions for All-to-All(v) communication in dynamic Mixture-of-Experts workloads by addressing traffic skew and incast congestion while drastically reducing synthesis time on modern GPU clusters.

Yiran Lei, Dongjoo Lee, Liangyu Zhao, Daniar Kurniawan, Chanmyeong Kim, Heetaek Jeong, Changsu Kim, Hyeonseong Choi, Liangcheng Yu, Arvind Krishnamurthy, Justine Sherry, Eriko NurvitadhiMon, 09 Ma💻 cs

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

{\lambda}Scale is an efficient serverless inference system that accelerates large language model scaling by leveraging high-speed RDMA networks for fast model multicast and enabling "execute-while-load" distributed inference, thereby significantly reducing tail latency and costs compared to state-of-the-art solutions.

Minchen Yu, Rui Yang, Chaobo Jia, Zhaoyuan Su, Sheng Yao, Tingfeng Lan, Yuchen Yang, Zirui Wang, Yue Cheng, Wei Wang, Ao Wang, Ruichuan ChenMon, 09 Ma💻 cs

Provuse: Platform-Side Function Fusion for Performance and Efficiency in FaaS Environments

This paper introduces Provuse, a transparent platform-side optimization for FaaS environments that automatically fuses independently deployed functions at runtime to eliminate redundant instances, thereby reducing end-to-end latency by an average of 26.33% and RAM usage by 53.57% without requiring any code changes from developers.

Niklas Kowallik, Natalie Carl, Leon Pöllinger, Wei Wang, Sharan Santhahanam, David BermbachMon, 09 Ma💻 cs

Knowledge-driven Reasoning for Mobile Agentic AI: Concepts, Approaches, and Directions

This paper proposes a knowledge-driven reasoning framework for mobile agentic AI that extracts and synchronizes reusable decision structures to optimize on-device performance under resource and connectivity constraints, demonstrating that an optimal, non-monotonic level of knowledge injection significantly enhances mission reliability and efficiency compared to existing approaches.

Guangyuan Liu, Changyuan Zhao, Yinqiu Liu, Dusit Niyato, Biplab SikdarMon, 09 Ma💻 cs

Gathering Autonomous Mobile Robots Under the Adversarial Defected View Model

This paper presents two distributed algorithms that guarantee deterministic finite-time gathering for NN oblivious autonomous mobile robots in the Euclidean plane under the adversarial defected view model, achieving success in the fully synchronous setting with a (4, 2) fault constraint and in the asynchronous setting with a general (N, K) fault constraint, both under non-rigid motion.

Prakhar Shukla, Seshunadh Tanuj Peddinti, Subhash BhagatMon, 09 Ma💻 cs

Why Ethereum Needs Fairness Mechanisms that Do Not Depend on Participant Altruism

This paper argues that Ethereum's decentralization and censorship resistance ideals cannot be restored by relying on altruistic block proposers, as empirical analysis reveals that less than 1.4% of proposers consistently act in accordance with these objectives, thereby necessitating the implementation of incentive- or penalty-based fairness mechanisms.

Patrick Spiesberger, Nils Henrik Beyer, Hannes HartensteinMon, 09 Ma💻 cs

A Lock-Free Work-Stealing Algorithm for Bulk Operations

This paper presents a specialized lock-free work-stealing queue designed for a master-worker framework in mixed-integer programming solvers that leverages restricted concurrency assumptions to support native bulk operations and achieve constant-latency push performance, significantly outperforming general-purpose implementations like C++ Taskflow in batch processing scenarios.

Raja Sai Nandhan Yadav Kataru, Danial Davarnia, Ali JannesariMon, 09 Ma🔢 math

Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks

This paper investigates parallelization strategies for deploying dense LLMs, demonstrating that while Tensor Parallelism optimizes latency and Pipeline Parallelism enhances throughput, a hybrid approach allows for effective control over the inherent latency-throughput tradeoff to meet specific application requirements.

Burak Topcu, Musa Oguzhan Cim, Poovaiah Palangappa, Meena Arunachalam, Mahmut Taylan KandemirMon, 09 Ma🤖 cs.LG

First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints

This paper proposes a first-order Softmax-Weighted Switching Gradient method for distributed stochastic minimax optimization under stochastic constraints, achieving optimal oracle complexity and high-probability convergence guarantees in both full and partial client participation settings while avoiding the instability of traditional primal-dual approaches.

Zhankun Luo, Antesh Upadhyay, Sang Bin Moon, Abolfazl HashemiMon, 09 Ma🤖 cs.LG

StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

StreamWise is an adaptive, modular serving system that leverages heterogeneous hardware and dynamic resource management to enable cost-effective, high-quality real-time multi-modal generation (such as podcast videos) with sub-second startup delays, overcoming the latency and complexity challenges of coordinating diverse models at scale.

Haoran Qiu, Gohar Irfan Chaudhry, Chaojie Zhang, Íñigo Goiri, Esha Choukse, Rodrigo Fonseca, Ricardo BianchiniMon, 09 Ma🤖 cs.AI

Radiation Hydrodynamics at Scale: Comparing MPI and Asynchronous Many-Task Runtimes with FleCSI

This paper benchmarks the FleCSI framework's MPI, Legion, and HPX backends using Poisson and radiation hydrodynamics applications on up to 1024 nodes, revealing that while the MPI backend offers superior scalability for communication-heavy tasks, the HPX backend delivers significant performance gains (up to 1.64x speedup) for computation-intensive hydrodynamics workloads on smaller node counts.

Alexander Strack, Hartmut Kaiser, Dirk Pflüger2026-03-06💻 cs