This paper presents new deviation inequalities that are valid uniformly in time under adaptive sampling in a multi-armed bandit model. The deviations are measured using the Kullback-Leibler divergence in a given one-dimensional exponential family, and may take into account several arms at a time. They are obtained by constructing for each arm a mixture martingale based on a hierarchical prior, and by multiplying those martingales. Our deviation inequalities allow us to analyze stopping rules based on generalized likelihood ratios for a large class of sequential identification problems, and to construct tight confidence intervals for some functions of the means of the arms.
翻译:本文介绍了在多武装匪徒模式的适应性抽样下,在时间上统一有效的新的偏差不平等。偏差是使用特定单维指数家庭中的Kullback-Leebler差异来衡量的,并且可以一次考虑若干个手臂。这些偏差是通过为每只手臂制造一种基于先前等级的混合马丁格和乘以这些马丁格获得的。我们的偏差使我们能够分析基于大规模连续识别问题的普遍可能性比率的停止规则,并为武器手段的某些功能建立严格的信任间隔。