模型作为代理: 优化模型为基础的多智能体强化学习中交互式局部模型的多步预测 (Models as Agents: Optimizing Multi-Step Predictions of Interactive Local Models in Model-Based Multi-Agent Reinforcement Learning)

Research in model-based reinforcement learning has made significant progress in recent years. Compared to single-agent settings, the exponential dimension growth of the joint state-action space in multi-agent systems dramatically increases the complexity of the environment dynamics, which makes it infeasible to learn an accurate global model and thus necessitates the use of agent-wise local models. However, during multi-step model rollouts, the prediction of one local model can affect the predictions of other local models in the next step. As a result, local prediction errors can be propagated to other localities and eventually give rise to considerably large global errors. Furthermore, since the models are generally used to predict for multiple steps, simply minimizing one-step prediction errors regardless of their long-term effect on other models may further aggravate the propagation of local errors. To this end, we propose Models as AGents (MAG), a multi-agent model optimization framework that reversely treats the local models as multi-step decision making agents and the current policies as the dynamics during the model rollout process. In this way, the local models are able to consider the multi-step mutual affect between each other before making predictions. Theoretically, we show that the objective of MAG is approximately equivalent to maximizing a lower bound of the true environment return. Experiments on the challenging StarCraft II benchmark demonstrate the effectiveness of MAG.

翻译：研究模型为基础的强化学习近年来取得了重大进展。与单智能体环境相比，多智能体系统中联合状态-动作空间的指数维数增长显著增加了环境动力学的复杂性，使得学习准确的全局模型不可行，并因此需要使用特定代理的局部模型。然而，在多步模型回滚期间，一个局部模型的预测可以影响下一步的其他局部模型的预测。因此，局部预测误差可以传播到其他地方，并最终导致相当大的全局误差。此外，由于模型通常用于多步预测，仅仅减少一步预测误差而不考虑其对其他模型的长期影响可能会进一步加剧局部误差的传播。为此，我们提出了模型作为代理(MAG)，这是一个多智能体模型优化框架，将局部模型相互独立地作为多步决策制定代理，并将当前策略视为模型回滚期间的动态。通过这种方式，局部模型能够在进行预测之前考虑彼此之间的多步相互影响。从理论上讲，我们展示了 MAG 的目标近似等同于最大化环境回报的下限。对具有挑战性的 StarCraft II 基准测试的实验表明了 MAG 的有效性。