This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $\gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).
翻译:本文涉及强化学习的样本效率问题,假设可以访问生成模型(或模拟器)。我们首先考虑带有γ折扣的无限时间马尔可夫决策过程(MDPs),其状态空间为$\mathcal{S}$,动作空间为$ \mathcal{A} $。尽管之前有许多研究尝试解决这个问题,但样本复杂度和统计精度之间的权衡尚未完全确定。特别地,所有先前的结果都存在一个严重的样本大小障碍,即只有当样本大小至少达到$\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}$时,它们的统计保证才成立。本文通过证明最小最大算法(扰动的基于模型的算法和保守的基于模型的算法)的最优性,从而克服了这一障碍,只要样本大小大于$\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$(有些对数因子),这些算法即可得到保证。超越无限时间MDPs,我们进一步研究了时变的有限时间MDPs,并且证明了简单的基于模型的规划算法足以在任何目标精度水平下实现最小最大优秀的样本复杂度。据我们所知,这项工作提供了第一个适用于整个样本大小范围(超出该范围寻找有意义的策略在信息理论上是不可行的)的最小最大保障。