We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. We also prove an information-theoretical lower bound which says that a switching cost of $\Omega(HSA)$ is required for any no-regret algorithm. As a byproduct, our new algorithmic techniques allow us to derive a \emph{reward-free} exploration algorithm with an optimal switching cost of $O(HSA)$.
翻译:我们研究的是强化学习(RL)问题,其费用低(政策)转换成本低(政策)转换成本低(RL)问题,这是现实的RL应用程序所引发的一个问题,在这种应用程序中,新政策部署费用昂贵,政策更新的次数必须低。在本文中,我们提出基于分阶段探索和适应性政策消除的新的算法,从而导致对美元(全局){O}(sqrt{H}}(Sqrt{H}4S ⁇ 2S ⁇ 2AT})的遗憾,同时要求以美元(HSA\log\logT)来转换成本不明的SO(HA2SA\log T)决定程序模式中的国家和行动数量。这是与最著名的转换成本(HSO)的美元(HS)现有方法中最著名的转换费用(HAAA-therical_Oralxal)相比,也证明了一个以成本较低的成本转换为成本(USA-HS)任何标准。