兼并睡觉的强盗与公平制约 (Combinatorial Sleeping Bandits with Fairness Constraints)

The multi-armed bandit (MAB) model has been widely adopted for studying many practical optimization problems (resource allocation, ad placement, crowdsourcing, etc.) with unknown parameters. The goal of the player here is to maximize the cumulative reward in the face of uncertainty. However, the basic MAB model neglects several important factors of the system in many real-world applications, where multiple arms can be simultaneously played and an arm could sometimes be "sleeping". Besides, ensuring fairness is also a key design concern in practice. To that end, we propose a new Combinatorial Sleeping MAB model with Fairness constraints, called CSMAB-F, aiming to address the aforementioned crucial modeling issues. The objective is now to maximize the reward while satisfying the fairness requirement of a minimum selection fraction for each individual arm. To tackle this new problem, we extend an online learning algorithm, called UCB, to deal with a critical tradeoff between exploitation and exploration and employ the virtual queue technique to properly handle the fairness constraints. By carefully integrating these two techniques, we develop a new algorithm, called Learning with Fairness Guarantee (LFG), for the CSMAB-F problem. Further, we rigorously prove that not only LFG is feasibility-optimal, but it also has a time-average regret upper bounded by \frac{N}{2\eta}+\frac{\beta_1\sqrt{mNT\log{T}}+\beta_2 N}{T}, where N is the total number of arms, m is the maximum number of arms that can be simultaneously played, T is the time horizon, \beta_1 and \beta_2 are constants, and \eta is a design parameter that we can tune. Finally, we perform extensive simulations to corroborate the effectiveness of the proposed algorithm. Interestingly, the simulation results reveal an important tradeoff between the regret and the speed of convergence to a point satisfying the fairness constraints.

翻译：多臂强盗模式(MAB)已被广泛采用,用于研究许多实际优化问题(资源分配、广告放置、众包等),且其参数未知。这里玩家的目标是在面临不确定性的情况下最大限度地增加累积奖励。然而,基本的MAB模式在许多现实世界应用中忽略了系统的若干重要因素,在现实世界应用中,可以同时播放多臂,手臂有时可以“睡觉 ” 。此外,确保公平也是实践中一个关键的设计问题。为此,我们提出了一个新的组合式睡眠模型,该模型与公平制约有关,称为CSMAAB-F,旨在解决上述至关重要的模型问题。现在的目标是在满足每个手臂最低选择部分的公平要求的同时,最大限度地获得奖励。为了解决这个新问题,我们扩展了在线学习算法,称为UCB,处理开采和勘探之间的重大交易,使用虚拟排队技术来正确处理公平制约。通过仔细整合这两种技术,我们开发了一个新的算法,称为公平保证(LFG),这是CSMAAB-F的总时间值2, 高级武器设计是稳定的,这是我们所要严格证明的数值。