多智能体强化学习中的离策略行动预期 (Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning)

Learning anticipation in Multi-Agent Reinforcement Learning (MARL) is a reasoning paradigm where agents anticipate the learning steps of other agents to improve cooperation among themselves. As MARL uses gradient-based optimization, learning anticipation requires using Higher-Order Gradients (HOG), with so-called HOG methods. Existing HOG methods are based on policy parameter anticipation, i.e., agents anticipate the changes in policy parameters of other agents. Currently, however, these existing HOG methods have only been applied to differentiable games or games with small state spaces. In this work, we demonstrate that in the case of non-differentiable games with large state spaces, existing HOG methods do not perform well and are inefficient due to their inherent limitations related to policy parameter anticipation and multiple sampling stages. To overcome these problems, we propose Off-Policy Action Anticipation (OffPA2), a novel framework that approaches learning anticipation through action anticipation, i.e., agents anticipate the changes in actions of other agents, via off-policy sampling. We theoretically analyze our proposed OffPA2 and employ it to develop multiple HOG methods that are applicable to non-differentiable games with large state spaces. We conduct a large set of experiments and illustrate that our proposed HOG methods outperform the existing ones regarding efficiency and performance.

翻译：学习预期是多智能体强化学习（MARL）中的一种推理范例，它指代智能体可以预知其他智能体学习的步骤，以提高它们之间的协作。由于MARL使用基于梯度的优化方法，学习预期需要使用Higher-Order Gradients (HOG)方法。现有的HOG方法基于策略参数预期，即智能体预测其他智能体策略参数的变化。然而，目前这些现有的HOG方法仅应用于可微分的游戏或具有小状态空间的游戏。在本研究中，我们证明对于具有大状态空间的不可微分游戏，现有的HOG方法并不有效，效率低下，因为它们固有的策略参数预期和多次采样阶段的限制。为了解决这些问题，我们提出了离策略行动预期（OffPA2），这是一种新颖的框架，通过策略策略采样，通过行动预期，即智能体预测其他智能体动作的变化。我们理论分析了所提出的OffPA2，并利用它开发了多个适用于具有大状态空间的不可微分游戏的HOG方法。我们进行了一系列大量的实验，并证明我们提出的HOG方法在效率和性能方面优于现有的方法。