We study distributionally robust Markov games (DR-MGs) with the average-reward criterion, a framework for multi-agent decision-making under uncertainty over extended horizons. In average reward DR-MGs, agents aim to maximize their worst-case infinite-horizon average reward, to ensure satisfactory performance under environment uncertainties and opponent actions. We first establish a connection between the best-response policies and the optimal policies for the induced single-agent problems. Under a standard irreducible assumption, we derive a correspondence between the optimal policies and the solutions of the robust Bellman equation, and derive the existence of stationary Nash Equilibrium (NE) based on these results. We further study DR-MGs under the weakly communicating setting, where we construct a set-valued map and show its value is a subset of the best-response policies, convex and upper hemi-continuous, and derive the existence of NE. We then explore algorithmic solutions, by first proposing a Robust Nash-Iteration algorithm and providing convergence guarantees under some additional assumptions and a NE computing oracle. We further develop a temporal-difference based algorithm for DR-MGs, and provide convergence guarantees without any additional oracle or assumptions. Finally, we connect average-reward robust NE to discounted ones, showing that the average reward robust NE can be approximated by the discounted ones under a large discount factor. Our studies provide a comprehensive theoretical and algorithmic foundation for decision-making in complex, uncertain, and long-running multi-player environments.
翻译:本文研究基于平均奖励准则的分布鲁棒马尔可夫博弈(DR-MGs),该框架适用于长期不确定环境下的多智能体决策问题。在平均奖励DR-MGs中,智能体旨在最大化其最坏情况下的无限时域平均奖励,以确保在环境不确定性和对手行动下的性能表现。我们首先建立了最优响应策略与诱导单智能体问题最优策略之间的联系。在标准不可约假设下,推导出最优策略与鲁棒贝尔曼方程解之间的对应关系,并基于这些结果证明了平稳纳什均衡(NE)的存在性。进一步,我们在弱通信设定下研究DR-MGs,构建了集值映射并证明其值为最优响应策略的子集,具有凸性和上半连续性,从而推导出NE的存在性。随后我们探索算法解决方案:首先提出鲁棒纳什迭代算法,并在附加假设及NE计算预言机条件下给出收敛性保证;进一步开发了基于时序差分的DR-MGs算法,无需额外预言机或假设即可提供收敛性保证。最后,我们将平均奖励鲁棒NE与折扣奖励情形关联,证明大折扣因子下折扣奖励鲁棒NE可逼近平均奖励鲁棒NE。本研究为复杂、不确定、长期运行的多玩家环境中的决策问题提供了完整的理论与算法基础。