We study the emergence of cooperative behaviors in reinforcement learning agents by introducing a challenging competitive multi-agent soccer environment with continuous simulated physics. We demonstrate that decentralized, population-based training with co-play can lead to a progression in agents' behaviors: from random, to simple ball chasing, and finally showing evidence of cooperation. Our study highlights several of the challenges encountered in large scale multi-agent training in continuous control. In particular, we demonstrate that the automatic optimization of simple shaping rewards, not themselves conducive to co-operative behavior, can lead to long-horizon team behavior. We further apply an evaluation scheme, grounded by game theoretic principals, that can assess agent performance in the absence of pre-defined evaluation tasks or human baselines.
翻译:我们通过引入具有连续模拟物理学的具有挑战性的多试剂足球环境,研究在强化学习剂中出现合作行为的问题。我们证明,分散化的、以人口为基础的、带有共同玩耍的培训可以导致代理人行为的发展:从随机的到简单的球追逐,最后显示合作的证据。我们的研究强调了在连续控制方面大规模多试剂培训中遇到的一些挑战。特别是,我们证明,自动优化简单的塑造奖赏本身不利于合作行为,可能导致长期的团队行为。我们进一步采用了一种以游戏理论原则为基础的评估计划,该计划可以在没有预先确定的评价任务或人类基线的情况下评估代理人的表现。