超越奖励:从等级角度看离线多剂行为分析 (Beyond Rewards: a Hierarchical Perspective on Offline Multiagent Behavioral Analysis)

Each year, expert-level performance is attained in increasingly-complex multiagent domains, notable examples including Go, Poker, and StarCraft II. This rapid progression is accompanied by a commensurate need to better understand how such agents attain this performance, to enable their safe deployment, identify limitations, and reveal potential means of improving them. In this paper we take a step back from performance-focused multiagent learning, and instead turn our attention towards agent behavior analysis. We introduce a model-agnostic method for discovery of behavior clusters in multiagent domains, using variational inference to learn a hierarchy of behaviors at the joint and local agent levels. Our framework makes no assumption about agents' underlying learning algorithms, does not require access to their latent states or models, and can be trained using entirely offline observational data. We illustrate the effectiveness of our method for enabling the coupled understanding of behaviors at the joint and local agent level, detection of behavior changepoints throughout training, discovery of core behavioral concepts (e.g., those that facilitate higher returns), and demonstrate the approach's scalability to a high-dimensional multiagent MuJoCo control domain.

翻译：每年,专家一级的业绩在日益复杂的多试剂领域取得,显著的例子包括Go、Poker和StarCraft II。这一快速进展伴随着一种相应的需要,即更好地了解这些代理人如何取得这种业绩,以便能够安全地部署,查明局限性,并揭示可能的改进手段。在本文件中,我们从注重业绩的多试剂学习中倒退一步,而把我们的注意力转向代理人行为分析。我们采用了一种在多试剂领域发现行为集群的模型性不可知性方法,利用变式推论来学习联合和地方代理人一级的行为等级。我们的框架不假定代理人的基本学习算法,不要求接触其潜在状态或模式,而是可以使用完全离线观测数据进行培训。我们展示了我们的方法的有效性,即能够同时理解联合和地方代理人一级的行为,在整个培训过程中发现行为变化点,发现核心行为概念(例如有助于较高回报的),并展示该方法对于高层次多试管 MuJoco控制领域的可扩展性。