Many recent breakthroughs in multi-agent reinforcement learning (MARL) require the use of deep neural networks, which are challenging for human experts to interpret and understand. On the other hand, existing work on interpretable RL has shown promise in extracting more interpretable decision tree-based policies, but only in the single-agent setting. To fill this gap, we propose the first set of interpretable MARL algorithms that extract decision-tree policies from neural networks trained with MARL. The first algorithm, IVIPER, extends VIPER, a recent method for single-agent interpretable RL, to the multi-agent setting. We demonstrate that IVIPER can learn high-quality decision-tree policies for each agent. To better capture coordination between agents, we propose a novel centralized decision-tree training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees, and uses resampling to focus on states that are critical for its interactions with other agents. We show that both algorithms generally outperform the baselines and that MAVIPER-trained agents achieve better-coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.
翻译:在多试剂强化学习(MARL)方面,最近的许多突破要求使用深层神经网络,这对人类专家解释和理解具有挑战性。另一方面,关于可解释的RL的现有工作在提取更易解释的决策树政策方面显示了希望,但只是在单一试剂环境下。为了填补这一空白,我们提议了第一批可解释的MARL算法,从经过MARL培训的神经网络中提取决策树政策。第一个算法,VIPER将最新的单一试剂可解释RL方法VIPER延伸至多试剂设置。我们证明,IVIPER可以为每个试剂学习高质量的决策树政策。为了更好地捕捉剂之间的协调,我们建议了一个新的集中决策树培训算法,MAVIPER。MAVIPER通过预测其他试剂使用预期树的行为,联合种植每种剂的树木,并使用抽样来关注对其与其他试剂互动至关重要的国家。我们表明,这两种算法一般都比基线高,而且经过MAVIPER培训的代理人在三种不同的磁剂环境中都达到更好的协调性。