Non-Rectangular Average-Reward Robust MDPs: Optimal Policies and Their Transient Values
This paper establishes that history-dependent policies with sublinear expected regret are robust-optimal for non-rectangular average-reward robust MDPs without requiring rectangularity, and introduces a transient-value framework with an epoch-based policy that achieves constant-order finite-time performance by combining worst-case optimality with online learning.