We propose a unified mechanism for achieving coordination and communication in Multi-Agent Reinforcement Learning (MARL), through rewarding agents for having causal influence over other agents' actions. Causal influence is assessed using counterfactual reasoning. At each timestep, an agent simulates alternate actions that it could have taken, and computes their effect on the behavior of other agents. Actions that lead to bigger changes in other agents' behavior are considered influential and are rewarded. We show that this is equivalent to rewarding agents for having high mutual information between their actions. Empirical results demonstrate that influence leads to enhanced coordination and communication in challenging social dilemma environments, dramatically increasing the learning curves of the deep RL agents, and leading to more meaningful learned communication protocols. The influence rewards for all agents can be computed in a decentralized way by enabling agents to learn a model of other agents using deep neural networks. In contrast, key previous works on emergent communication in the MARL setting were unable to learn diverse policies in a decentralized manner and had to resort to centralized training. Consequently, the influence reward opens up a window of new opportunities for research in this area.
翻译:我们建议一个统一的机制,通过奖励对其它代理人的行为具有因果关系的代理人,实现多机构强化学习(MARL)的协调与沟通,通过奖励对其它代理人的行为具有因果关系的代理人。 利用反事实推理对因果关系进行评估。 每次时间步骤,一个代理人模拟它本可以采取的替代行动,并计算其对其他代理人行为的影响。 导致其他代理人行为发生更大变化的行动被认为是有影响力的,并受到奖励。 我们表明这相当于奖励代理人彼此行为之间有高度的相互信息。 经验结果显示,影响导致在挑战性的社会两难环境中加强协调与沟通,大大增加深层次的RL代理人的学习曲线,并导致更有意义的学习通信协议。 对所有代理人的影响奖励可以分散进行计算,使代理人能够利用深层的神经网络学习其他代理人的模式。 相比之下,MARL环境中新出现通信的关键工作无法以分散的方式学习不同的政策,不得不诉诸集中培训。 因此,影响奖励为该领域的研究开辟了新机会的窗口。