In this paper, we investigate the concentration properties of cumulative reward in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. Our asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithms, while our non-asymptotic bounds include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithms. Additionally, we explore two key implications of our results. First, we analyze the sample path behavior of the difference in rewards between any two stationary policies. Second, we show that two alternative definitions of regret for learning policies proposed in the literature are rate-equivalent. Our proof techniques rely on a martingale decomposition of cumulative reward, properties of the solution to the policy evaluation fixed-point equation, and both asymptotic and non-asymptotic concentration results for martingale difference sequences.
翻译:本文研究了马尔可夫决策过程(MDPs)中累积奖励的集中性质,重点关注渐近与非渐近两种设定。我们提出了一种统一方法来刻画MDP中的奖励集中性,涵盖无限时域设定(即平均奖励与折扣奖励框架)和有限时域设定。渐近结果包括大数定律、中心极限定理与重对数律,非渐近界则包括Azuma-Hoeffding型不等式及重对数律的非渐近版本。此外,我们探讨了研究结果的两个关键应用:首先,分析了任意两个平稳策略间奖励差异的样本路径行为;其次,证明了文献中提出的两种学习策略遗憾定义具有速率等价性。证明技术基于累积奖励的鞅分解、策略评估不动点方程解的性质,以及鞅差序列的渐近与非渐近集中性结果。