政策梯度方法与Nash在一般随机游戏中平衡方法的趋同 (On the convergence of policy gradient methods to Nash equilibria in general stochastic games)

Learning in stochastic games is a notoriously difficult problem because, in addition to each other's strategic decisions, the players must also contend with the fact that the game itself evolves over time, possibly in a very complicated manner. Because of this, the convergence properties of popular learning algorithms - like policy gradient and its variants - are poorly understood, except in specific classes of games (such as potential or two-player, zero-sum games). In view of this, we examine the long-run behavior of policy gradient methods with respect to Nash equilibrium policies that are second-order stationary (SOS) in a sense similar to the type of sufficiency conditions used in optimization. Our first result is that SOS policies are locally attracting with high probability, and we show that policy gradient trajectories with gradient estimates provided by the REINFORCE algorithm achieve an $\mathcal{O}(1/\sqrt{n})$ distance-squared convergence rate if the method's step-size is chosen appropriately. Subsequently, specializing to the class of deterministic Nash policies, we show that this rate can be improved dramatically and, in fact, policy gradient methods converge within a finite number of iterations in that case.

翻译：在随机游戏中学习是一个臭名昭著的困难问题,因为除了彼此的战略决定之外,玩家还必须与游戏本身随时间演变,可能以非常复杂的方式演变这一事实作斗争。因此,流行学习算法的趋同性(如政策梯度及其变异性)不易理解,除非在特定的游戏类别(如潜在或双玩家、零和游戏)中。有鉴于此,我们审视了政策梯度方法对于第二阶固定(SOS)的纳什平衡政策的长期行为,这种政策与在优化中使用的充足条件类型类似。我们的第一个结果是SOS政策在当地吸引了很高的概率,并且我们表明,REINFORCE算法提供的梯度估计数的政策梯度梯度轨可以达到$gascal{O}(1/\\\ sqrt{n}) 美元远平方趋同性趋同率率,如果适当选择了该方法的级数。随后,我们专门研究确定性纳什政策类别,我们发现这一比率可以大幅改进,事实上是定式的政策方法。