We consider the adversarial Markov Decision Process (MDP) problem, where the rewards for the MDP can be adversarially chosen, and the transition function can be either known or unknown. In both settings, Follow-the-PerturbedLeader (FPL) based algorithms have been proposed in previous literature. However, the established regret bounds for FPL based algorithms are worse than algorithms based on mirrordescent. We improve the analysis of FPL based algorithms in both settings, matching the current best regret bounds using faster and simpler algorithms.
翻译:我们认为敌对的Markov决定程序(MDP)问题,在这个程序中,对MDP的奖励可以以敌对方式选择,而过渡功能可以是已知的或未知的。在这两种情况下,前文文献都提出了基于跟踪PerturbedLeader(FPL)的算法。然而,对基于FPL的算法的既定遗憾界限比基于镜光的算法要差。我们改进了两种情况下对基于FPL的算法的分析,用更快、更简单的算法对当前最后悔的界限进行比对。