Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.
翻译:受机器人学、运筹学和医疗保健等对长期稳定性能要求严格的实际应用驱动,本文研究了分布鲁棒(DR)平均奖励强化学习问题。我们提出了两种实现近似最优样本复杂度的算法。第一种算法将问题简化为DR折扣马尔可夫决策过程(MDP),而第二种算法——锚定DR平均奖励MDP——通过引入锚定状态来稳定不确定性集合内的受控转移核。假设名义MDP具有一致遍历性,我们证明在KL散度和$f_k$散度定义的不确定性集合下,当不确定性半径足够小时,两种算法均能以$\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$的样本复杂度估计最优策略及鲁棒平均奖励。其中$\varepsilon$为目标精度,$|\mathbf{S}|$和$|\mathbf{A}|$分别表示状态空间和动作空间的规模,$t_{\mathrm{mix}}$为名义MDP的混合时间。这是DR平均奖励强化学习领域首个有限样本收敛性保证。我们进一步通过数值实验验证了算法的收敛速率。