Monte-Carlo planning and Reinforcement Learning (RL) are essential to sequential decision making. The recent AlphaGo and AlphaZero algorithms have shown how to successfully combine these two paradigms in order to solve large scale sequential decision problems. These methodologies exploit a variant of the well-known UCT algorithm to trade off exploitation of good actions and exploration of unvisited states, but their empirical success comes at the cost of poor sample-efficiency and high computation time. In this paper, we overcome these limitations by considering convex regularization in Monte-Carlo Tree Search (MCTS), which has been successfully used in RL to efficiently drive exploration. First, we introduce a unifying theory on the use of generic convex regularizers in MCTS, deriving the regret analysis and providing guarantees of exponential convergence rate. Second, we exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update, and on the Tsallis entropy of the policy. Finally, we empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor, from a toy problem to several Atari games, showing their superiority w.r.t. representative baselines.
翻译:最近的阿尔法戈和阿尔法泽罗算法展示了如何成功地将这两种模式结合起来,以解决大规模连续决策问题。这些方法利用了众所周知的UCT算法的变式,以交换利用良好行动和探索未访问的状态,但其经验成功的代价是抽样效率低和计算时间高。在本文件中,我们通过考虑蒙特-卡洛树搜索(MCTS)的正统化来克服这些限制,后者在RL中成功地用于有效推动探索。首先,我们引入了关于使用MCTS中通用的正统方程式的统一理论,得出遗憾分析,并提供指数趋同率的保证。第二,我们利用我们的理论框架,根据政策更新的相对酶状和政策的Tsalliis entropy,为MCTS引进新的正规化后继操作器。最后,我们从经验上评价了阿尔法戈和阿尔法泽罗的拟议操作者关于日益增强的维度和分支的游戏问题,从一个基线到一个问题。