We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm if it were to be run on its own. The main challenge is that when run with a master, base algorithms unavoidably receive much less feedback and it is thus critical that the master not starve a base algorithm that might perform uncompetitively initially but would eventually outperform others if given enough feedback. We address this difficulty by devising a version of Online Mirror Descent with a special mirror map together with a sophisticated learning rate scheme. We show that this approach manages to achieve a more delicate balance between exploiting and exploring base algorithms than previous works yielding superior regret bounds. Our results are applicable to many settings, such as multi-armed bandits, contextual bandits, and convex bandits. As examples, we present two main applications. The first is to create an algorithm that enjoys worst-case robustness while at the same time performing much better when the environment is relatively easy. The second is to create an algorithm that works simultaneously under different assumptions of the environment, such as different priors or different loss structures.
翻译:我们研究的是将多重土匪算法(即在线学习算法和部分反馈)与建立总算算法(即在线学习算法)相结合的问题,后者的目标是,如果要独立运行,它几乎和最佳基础算法一样运行。主要的挑战在于,当主算人操作时,基本算法不可避免地得到的反馈要少得多,因此,至关重要的是,船长不饿死基础算法,这种算法最初可能不具有竞争力,但如果有足够的反馈,最终会比其他人好。我们通过设计一个版本的在线镜像源代码,配有特殊的镜像地图,加上一个复杂的学习率计划来解决这一困难。我们表明,这种方法在利用和探索基础算法与以前的工作相比,在产生超强的遗憾界限时,能够取得更微妙的平衡。我们的结果适用于许多环境,例如多武装的强盗、背景强盗和康韦克斯强盗。我们举例说,我们提出了两个主要应用程序。首先是要创建一种算法,在环境相对容易的情况下,具有最坏的强健性。第二个是创造一种在环境的不同假设下同时同时工作的一种算法,例如不同的先前或不同的损失结构。