Monte-Carlo Tree Search (MCTS) is one of the most-widely used methods for planning, and has powered many recent advances in artificial intelligence. In MCTS, one typically performs computations (i.e., simulations) to collect statistics about the possible future consequences of actions, and then chooses accordingly. Many popular MCTS methods such as UCT and its variants decide which computations to perform by trading-off exploration and exploitation. In this work, we take a more direct approach, and explicitly quantify the value of a computation based on its expected impact on the quality of the action eventually chosen. Our approach goes beyond the "myopic" limitations of existing computation-value-based methods in two senses: (I) we are able to account for the impact of non-immediate (ie, future) computations (II) on non-immediate actions. We show that policies that greedily optimize computation values are optimal under certain assumptions and obtain results that are competitive with the state-of-the-art.
翻译:Monte-Carlo树搜索(MCTS)是最广泛使用的规划方法之一,它使人造智能最近取得的许多进步成为动力。在MCTS中,人们通常进行计算(即模拟),收集行动今后可能产生的后果的统计数据,然后作出相应的选择。许多流行的 MCT方法,如UCT及其变体,决定通过交易勘探和开发进行哪些计算。在这项工作中,我们采取更直接的方法,根据对最终选择的行动的质量的预期影响,明确量化计算值。我们的方法超越了现有计算价值方法在两种意义上的“微小”限制:(I)我们有能力说明非中间(i,未来)计算(II)对非直接行动的影响。我们证明贪婪优化计算值的政策在某些假设下是最佳的,并获得与最新技术竞争的结果。