We consider the problem of simultaneous learning in stochastic games with many players in the finite-horizon setting. While the typical target solution for a stochastic game is a Nash equilibrium, this is intractable with many players. We instead focus on variants of {\it correlated equilibria}, such as those studied for extensive-form games. We begin with a hardness result for the adversarial MDP problem: even for a horizon of 3, obtaining sublinear regret against the best non-stationary policy is \textsf{NP}-hard when both rewards and transitions are adversarial. This implies that convergence to even the weakest natural solution concept -- normal-form coarse correlated equilbrium -- is not possible via black-box reduction to a no-regret algorithm even in stochastic games with constant horizon (unless $\textsf{NP}\subseteq\textsf{BPP}$). Instead, we turn to a different target: algorithms which {\it generate} an equilibrium when they are used by all players. Our main result is algorithm which generates an {\it extensive-form} correlated equilibrium, whose runtime is exponential in the horizon but polynomial in all other parameters. We give a similar algorithm which is polynomial in all parameters for "fast-mixing" stochastic games. We also show a method for efficiently reaching normal-form coarse correlated equilibria in "single-controller" stochastic games which follows the traditional no-regret approach. When shared randomness is available, the two generative algorithms can be extended to give simultaneous regret bounds and converge in the traditional sense.
翻译:我们考虑的是与许多玩家一起在有限偏顺设置的软盘游戏中同时学习的问题。 虽然随机游戏的典型目标解决方案是纳什平衡, 但对于许多玩家来说,这是棘手的。 我们更关注的是 {it 相关 equilibria} 的变体, 比如那些为广度游戏而研究的变体。 我们从对敌对的 MDP 问题的一个硬性结果开始: 即使是在3 的地平线上, 当奖赏和过渡都是对立的时, 获得对最佳非静止政策的亚线性遗憾是硬的。 这意味着即使最弱的自然解决方案概念 -- -- 正常- 正常- 变形变色相对的 quilbrium -- 也不可能通过黑箱减为无色的算法, 即使是在恒定的游戏中( $\ textfsf{NPZ} sucolicechachabe\ text f{BPPPPD} $) 。 相反, 我们转向一个不同的目标: 当所有玩家都使用时, orroralalal comal way 的方法都遵循一种平衡。 我们的主要结果是“ salal liver liver ladeal lade lave, lad.