An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for swap regret, and thus, along the way, imply convergence to a correlated equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of weighted regret minimization, with unknown weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees.
翻译:最近大量不可能的结果证明,在与敌对对手的马可夫游戏中,最小化的遗憾在统计和计算上都是棘手的。然而,所有这些结果都没有排除在假定所有各方都采用同样的学习程序的情况下将最小化的遗憾的可能性。在这项工作中,我们提出了在一般的马可夫游戏中学习的第一个(我们的知识)算法,该算法提供所有代理人执行的亚线性遗憾保证。我们得到的界限是交换遗憾,从而意味着接近相关的平衡。我们的算法是分散的,计算效率高,不需要代理商之间的任何交流。我们的主要观察是,通过马可夫游戏的政策优化进行的网上学习基本上会减少为一种加权的最小化,其重量由代理人政策序列的路径长度决定。因此,控制路径长度会导致加权的遗憾目标,而充分适应的算法则提供亚线性遗憾保证。