Centralised training with decentralised execution (CTDE) is an important learning paradigm in multi-agent reinforcement learning (MARL). To make progress in CTDE, we introduce Multi-Agent MuJoCo (MAMuJoCo), a novel benchmark suite that, unlike StarCraft Multi-Agent Challenge (SMAC), the predominant benchmark environment, applies to continuous robotic control tasks. To demonstrate the utility of MAMuJoCo, we present a range of benchmark results on this new suite, including comparing the state-of-the-art actor-critic method MADDPG against two novel variants of existing methods. These new methods outperform MADDPG on a number of MAMuJoCo tasks. In addition, we show that, in these continuous cooperative MAMuJoCo tasks, value factorisation plays a greater role in performance than the underlying algorithmic choices. This motivates the necessity of extending the study of value factorisations from $Q$-learning to actor-critic algorithms.
翻译:多试剂强化学习(MARL)方面的一个重要学习模式是集中化执行(CTDE)的集中化培训。为了在CTDE中取得进展,我们引入了多代理MuJoco(MAMuJoCo),这是一个新的基准套件,与主要基准环境StarCraft多代理挑战(SMAC)不同,它适用于连续的机器人控制任务。为了证明MAMuJoCo的效用,我们介绍了这个新套件的一系列基准结果,包括比较现有方法的两个新变体,这些新方法优于MADDPG。此外,我们表明,在这些持续合作的MAMUJoCo任务中,价值因素化在业绩上比基本的算法选择发挥更大的作用。这促使有必要将价值因素化研究从Q$学习扩大到行为体的算法。