SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

The paper proposes SENTINEL, a lightweight, momentum-based verification mechanism that ensures the integrity of pipeline parallel decentralized training across untrusted nodes by detecting corrupted inter-stage communications without computation duplication, thereby enabling the secure training of large-scale models like 4B-parameter LLMs.

Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe + 5 more2026-03-05🤖 cs.LG

Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

This paper identifies that LLM-based agents systematically fail at cloud root cause analysis due to architectural flaws like hallucinated data interpretation and incomplete exploration rather than model limitations, demonstrating that while prompt engineering is insufficient, enhancing inter-agent communication protocols can significantly reduce specific failure modes.

Taeyoon Kim, Woohyeok Park, Hoyeong Yun + 1 more2026-03-05🤖 cs.AI