We extend the notion of regret with a welfarist perspective. Focussing on the classic multi-armed bandit (MAB) framework, the current work quantifies the performance of bandit algorithms by applying a fundamental welfare function, namely the Nash social welfare (NSW) function. This corresponds to equating algorithm's performance to the geometric mean of its expected rewards and leads us to the study of Nash regret, defined as the difference between the -- a priori unknown -- optimal mean (among the arms) and the algorithm's performance. Since NSW is known to satisfy fairness axioms, our approach complements the utilitarian considerations of average (cumulative) regret, wherein the algorithm is evaluated via the arithmetic mean of its expected rewards. This work develops an algorithm that, given the horizon of play $T$, achieves a Nash regret of $O \left( \sqrt{\frac{{k \log T}}{T}} \right)$, here $k$ denotes the number of arms in the MAB instance. Since, for any algorithm, the Nash regret is at least as much as its average regret (the AM-GM inequality), the known lower bound on average regret holds for Nash regret as well. Therefore, our Nash regret guarantee is essentially tight. In addition, we develop an anytime algorithm with a Nash regret guarantee of $O \left( \sqrt{\frac{{k\log T}}{T}} \log T \right)$.
翻译:我们从推理的角度扩展了遗憾的概念。 聚焦于经典的多武装土匪( MAB) 框架, 目前的工作通过应用基本的福利功能, 即 Nash 社会福利( Nash ) 函数, 来量化土匪算法的效绩。 这相当于将算法的效绩等同于其预期回报的几何平均值, 并引导我们研究Nash 遗憾, 其定义是先验的 -- -- 最佳平均值( 在武器中) 和算法的性能之间的差别。 由于新南威尔士是众所周知的公平轴心, 我们的方法补充了平均( 累积) 遗憾( 累积) 的功率考虑, 其中算法是通过其预期回报的算术来评估。 这项工作发展了一种算法, 从游戏的地平价看, 纳什 的后悔是美元( orqrock T ⁇ T\\\\\ qright), 这里记下了武器数量。 由于任何算法, 纳什 的遗憾基本上被理解为, 我们平均的后悔。