合作多机构强化学习奖励机制 (Reward Machines for Cooperative Multi-Agent Reinforcement Learning)

from arxiv, Accepted at AAMAS 2021. Changes since last version: The paper's running example has been modified to simplify presentation (experimental section changed accordingly). Several proofs and definitions surrounding reward machines have been moved from the supplementary material into the body of the paper

In cooperative multi-agent reinforcement learning, a collection of agents learns to interact in a shared environment to achieve a common goal. We propose the use of reward machines (RM) -- Mealy machines used as structured representations of reward functions -- to encode the team's task. The proposed novel interpretation of RMs in the multi-agent setting explicitly encodes required teammate interdependencies, allowing the team-level task to be decomposed into sub-tasks for individual agents. We define such a notion of RM decomposition and present algorithmically verifiable conditions guaranteeing that distributed completion of the sub-tasks leads to team behavior accomplishing the original task. This framework for task decomposition provides a natural approach to decentralized learning: agents may learn to accomplish their sub-tasks while observing only their local state and abstracted representations of their teammates. We accordingly propose a decentralized q-learning algorithm. Furthermore, in the case of undiscounted rewards, we use local value functions to derive lower and upper bounds for the global value function corresponding to the team task. Experimental results in three discrete settings exemplify the effectiveness of the proposed RM decomposition approach, which converges to a successful team policy an order of magnitude faster than a centralized learner and significantly outperforms hierarchical and independent q-learning approaches.

翻译：在合作性多剂强化学习中,一批代理商学会在一个共同的环境中互动,以实现一个共同目标。我们提议使用奖赏机器(RM) -- -- 用作有结构的奖赏功能显示结构的机器(Mealy 机器) -- -- 来对团队的任务进行编码。在多试机构设置中拟议对RM的新型解释,明确编码要求团队之间的相互依存关系,使团队一级的任务能够分解成对单个代理商的子任务。我们定义了这样一个RM分解的概念,并提出了在逻辑上可以核查的条件,保证分层任务的分布完成导致团队行为完成最初的任务。这个任务分层化框架为分散学习提供了一种自然的方法:代理商可以学会完成分层任务,同时只观察其当地状况和队友的抽象表现。我们因此建议了分散式的q学习算法。此外,在不计酬的情况下,我们使用本地价值函数来得出与团队任务相应的全球价值函数的较低和上层界限。在三个离散的环境下,实验性结果展示了分散式学习方法的自然方法:代理人可以学会完成分级化,而较快地将一个统一的层次化的顺序排列成一个较快的顺序。