Existing risk-aware multi-armed bandit models typically focus on risk measures of individual options such as variance. As a result, they cannot be directly applied to important real-world online decision making problems with correlated options. In this paper, we propose a novel Continuous Mean-Covariance Bandit (CMCB) model to explicitly take into account option correlation. Specifically, in CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions. The agent's objective is to achieve the best trade-off between reward and risk, measured with option covariance. To capture important reward observation scenarios in practice, we consider three feedback settings, i.e., full-information, semi-bandit and full-bandit feedback. We propose novel algorithms with the optimal regrets (within logarithmic factors), and provide matching lower bounds to validate their optimalities. Our experimental results also demonstrate the superiority of the proposed algorithms. To the best of our knowledge, this is the first work that considers option correlation in risk-aware bandits and explicitly quantifies how arbitrary covariance structures impact the learning performance.
翻译:现有多武装土匪风险意识模型通常侧重于不同选项(如差异)的风险评估,因此,这些模型不能直接适用于重要的真实世界在线决策问题,与相关选项相关。在本文件中,我们提出了一个新的“连续平均合作盗匪(CMCB)”模型,以明确考虑到选项的相互关系。具体地说,在CMCB中,有一个学习者依次选择特定选项的权重矢量,并根据相关决定观察随机反馈。代理人的目标是实现奖赏与风险之间的最佳取舍,并以选项变量变量衡量。为了在实际中捕捉重要的奖赏观测情景,我们考虑三种反馈环境,即完整信息、半腰带和全腰带反馈。我们提出了带有最佳遗憾的新算法(在逻辑因素内),并提供匹配较低界限以验证其最佳性。我们的实验结果还显示了拟议算法的优越性。据我们所知,这是考虑风险觉察盗匪中选项相关性并明确量化任意共性结构如何影响学习绩效的首项工作。