In this paper, we study the stochastic combinatorial multi-armed bandit problem under semi-bandit feedback. While much work has been done on algorithms that optimize the expected reward for linear as well as some general reward functions, we study a variant of the problem, where the objective is to be risk-aware. More specifically, we consider the problem of maximizing the Conditional Value-at-Risk (CVaR), a risk measure that takes into account only the worst-case rewards. We propose new algorithms that maximize the CVaR of the rewards obtained from the super arms of the combinatorial bandit for the two cases of Gaussian and bounded arm rewards. We further analyze these algorithms and provide regret bounds. We believe that our results provide the first theoretical insights into combinatorial semi-bandit problems in the risk-aware case.
翻译:在本文中,我们研究了半匪帮反馈下的多武装盗匪问题。虽然在优化线性以及某些一般奖励功能的预期奖赏的算法方面做了大量工作,但我们研究了问题的一个变体,其目标在于风险意识。更具体地说,我们考虑了最大限度地提高风险值(CVaR)的问题,这是一个只考虑最坏情况奖励的风险评估措施。我们提出了新的算法,为Gaussian和捆绑式手臂奖励的两个案例最大限度地利用组合式盗匪超级手臂奖赏的CVaR。我们进一步分析了这些算法,并提供了遗憾。我们认为,我们的结果为风险意识案件中的组合半黑手党问题提供了初步的理论见解。