In this paper, we propose a novel framework for designing a fast convergent multi-agent reinforcement learning (MARL)-based medium access control (MAC) protocol operating in a single cell scenario. The user equipments (UEs) are cast as learning agents that need to learn a proper signaling policy to coordinate the transmission of protocol data units (PDUs) to the base station (BS) over shared radio resources. In many MARL tasks, the conventional centralized training with decentralized execution (CTDE) is adopted, where each agent receives the same global extrinsic reward from the environment. However, this approach involves a long training time. To overcome this drawback, we adopt the concept of learning a per-agent intrinsic reward, in which each agent learns a different intrinsic reward signal based solely on its individual behavior. Moreover, in order to provide an intrinsic reward function that takes into account the long-term training history, we represent it as a long shortterm memory (LSTM) network. As a result, each agent updates its policy network considering both the extrinsic reward, which characterizes the cooperative task, and the intrinsic reward that reflects local dynamics. The proposed learning framework yields a faster convergence and higher transmission performance compared to the baselines. Simulation results show that the proposed learning solution yields 75% improvement in convergence speed compared to the most performing baseline.
翻译:在本文中,我们提出了设计快速聚合多试剂强化学习(MARL)中型出入控制(MAC)协议的新框架。用户设备(UES)是作为学习者而推出的,他们需要学习适当的信号政策,以协调协议数据单位通过共享的无线电资源向基地站的传输。在许多MARL任务中,采用分散执行(CTDE)的常规集中培训,每个代理机构都从环境得到同样的全球极限奖励。然而,这一方法需要很长的培训时间。为了克服这一缺陷,我们采用了学习每个代理机构内在奖赏的概念,其中每个代理机构只根据其个人行为学习不同的内在奖赏信号。此外,为了提供考虑到长期培训历史的内在奖赏功能,我们把它当作一个长期的短期记忆(LSTM)网络。因此,每个代理机构更新其政策网络,既考虑作为合作性任务的特征的极端奖赏,也考虑反映反映地方趋同速度的内在奖赏。拟议的学习基准比75 学习基准显示学习成绩的更快的升级,比比学习基准 学习基准显示Simal 基准显示学习结果的更快的改进。</s>