In this work, we investigate constrained multi-agent reinforcement learning (CMARL), where agents collaboratively maximize the sum of their local objectives while satisfying individual safety constraints. We propose a framework where agents adopt coupled policies that depend on both local states and parameters, as well as those of their $κ_p$-hop neighbors, with $κ_p>0$ denoting the coupling distance. A distributed primal-dual algorithm is further developed under this framework, wherein each agent has access only to state-action pairs within its $2κ_p$-hop neighborhood and to reward information within its $κ+ 2κ_p$-hop neighborhood, with $κ> 0$ representing the truncation distance. Moreover, agents are not permitted to directly share their true policy parameters or Lagrange multipliers. Instead, each agent constructs and maintains local estimates of these variables for other agents and employs such estimates to execute its policy. Additionally, these estimates are further updated and exchanged exclusively through an independent, time-varying networks, which enhances the overall system security. We establish that, with high probability, our algorithm can achieve an $ε$-first-order stationary convergence with an approximation error of $\mathcal{O}(γ^{\frac{κ+1}{κ_{p}}})$ for discount factor $γ\in(0,1)$. Finally, simulations in GridWorld environment are conducted to demonstrate the effectiveness of the proposed algorithm.
翻译:本文研究约束多智能体强化学习问题,其中智能体需在满足个体安全约束的前提下,协作最大化其局部目标函数之和。我们提出一个采用耦合策略的框架,该策略同时依赖于局部状态参数与$κ_p$跳邻域内智能体的状态参数($κ_p>0$表示耦合距离)。基于此框架,我们进一步开发了分布式原始-对偶算法:每个智能体仅能获取其$2κ_p$跳邻域内的状态-动作对,以及$κ+2κ_p$跳邻域内的奖励信息($κ>0$表示截断距离)。此外,智能体被禁止直接共享真实的策略参数或拉格朗日乘子,而是通过构建并维护其他智能体变量的局部估计值来执行策略。这些估计值通过独立时变网络进行更新和交换,从而增强系统整体安全性。我们证明该算法以高概率实现$ε$一阶平稳收敛,其近似误差为$\mathcal{O}(γ^{\frac{κ+1}{κ_{p}}})$(折扣因子$γ\\in(0,1)$)。最后,通过GridWorld环境中的仿真实验验证了所提算法的有效性。