Linear bandits have a wide variety of applications including recommendation systems yet they make one strong assumption: the algorithms must know an upper bound $S$ on the norm of the unknown parameter $\theta^*$ that governs the reward generation. Such an assumption forces the practitioner to guess $S$ involved in the confidence bound, leaving no choice but to wish that $\|\theta^*\|\le S$ is true to guarantee that the regret will be low. In this paper, we propose novel algorithms that do not require such knowledge for the first time. Specifically, we propose two algorithms and analyze their regret bounds: one for the changing arm set setting and the other for the fixed arm set setting. Our regret bound for the former shows that the price of not knowing $S$ does not affect the leading term in the regret bound and inflates only the lower order term. For the latter, we do not pay any price in the regret for now knowing $S$. Our numerical experiments show standard algorithms assuming knowledge of $S$ can fail catastrophically when $\|\theta^*\|\le S$ is not true whereas our algorithms enjoy low regret.
翻译:线性土匪有各种各样的应用,包括建议系统,但他们却做出了一个强有力的假设:算法必须知道一个上限值$S美元,这是指导奖励产生过程的未知参数 $\theta ⁇ $的规范。这样的假设迫使执业者猜测信任约束下的美元,留下别无选择,只能希望$theta ⁇ le S$是真实的,以保证遗憾会很低。在本文中,我们提出了首次不需要这种知识的新型算法。具体地说,我们提出两个算法并分析他们的遗憾界限:一个用于改变手臂设置,另一个用于固定手臂设置。我们对前者的遗憾表明,不知道美元的代价不会影响遗憾约束中的主要时期,而只是夸大了较低的顺序时期。对于后者,我们并不为现在知道美元而感到遗憾而付出任何代价。我们的数字实验显示标准算法假设知道$S$是灾难性的,当$the ⁇ le S$是不真实的,而我们的算法则不那么后悔。