Reinforcement-based learning dynamics may exhibit several limitations when applied in a distributed setup. In (repeatedly-played) multi-player/action strategic-form games, and when each player applies an independent copy of the learning dynamics, convergence to (usually desirable) pure Nash equilibria cannot be guaranteed. Prior work has only focused on a small class of games, namely potential and coordination games. Furthermore, strong convergence guarantees (i.e., almost sure convergence or weak convergence) are mostly restricted to two-player games. To address this main limitation of reinforcement-based learning in repeatedly-played strategic-form games, this paper introduces a novel payoff-based learning scheme for distributed optimization in multi-player/action strategic-form games. We present an extension of perturbed learning automata (PLA), namely aspiration-based perturbed learning automata (APLA), in which each player's probability distribution for selecting actions is reinforced both by repeated selection and an aspiration factor that captures the player's satisfaction level. We provide a stochastic stability analysis of APLA in multi-player positive-utility games under the presence of noisy observations. This paper is the second part of this study that analyzes stochastic stability in multi-player/action weakly-acyclic games in the presence of noisy observations. We provide conditions under which convergence is attained (in weak sense) to the set of pure Nash equilibria and payoff-dominant equilibria. To the best of our knowledge, this is the first reinforcement-based learning scheme that addresses convergence in weakly-acyclic games. Lastly, we provide a specialization of the results to the classical Stag-Hunt game, supported by a simulation study.
翻译:在分布式设置中应用基于强化的学习动态可能表现出若干局限性。在(重复进行的)多玩家/多行动策略型博弈中,当每个玩家独立应用学习动态的副本时,无法保证收敛到(通常期望的)纯纳什均衡。先前的研究仅关注一小类博弈,即势博弈和协调博弈。此外,强收敛保证(即几乎必然收敛或弱收敛)大多局限于双人博弈。为应对重复进行策略型博弈中基于强化学习的主要局限性,本文提出了一种新颖的基于收益的学习方案,用于多玩家/多行动策略型博弈的分布式优化。我们介绍了扰动学习自动机(PLA)的扩展,即基于期望的扰动学习自动机(APLA),其中每个玩家选择行动的概率分布通过重复选择和捕获玩家满意度水平的期望因子共同强化。我们在存在噪声观测的情况下,对多玩家正效用博弈中APLA的随机稳定性进行了分析。本文是该研究的第二部分,分析了存在噪声观测时多玩家/多行动弱非循环博弈中的随机稳定性。我们提供了收敛(在弱意义下)到纯纳什均衡集和收益占优均衡集的条件。据我们所知,这是首个处理弱非循环博弈收敛性的基于强化的学习方案。最后,我们通过仿真研究支持,将结果专门应用于经典的猎鹿博弈。