To balance exploration and exploitation, multi-armed bandit algorithms need to conduct inference on the true mean reward of each arm in every time step using the data collected so far. However, the history of arms and rewards observed up to that time step is adaptively collected and there are known challenges in conducting inference with non-iid data. In particular, sample averages, which play a prominent role in traditional upper confidence bound algorithms and traditional Thompson sampling algorithms, are neither unbiased nor asymptotically normal. We propose a variant of a Thompson sampling based algorithm that leverages recent advances in the causal inference literature and adaptively re-weighs the terms of a doubly robust estimator on the true mean reward of each arm -- hence its name doubly-adaptive Thompson sampling. The regret of the proposed algorithm matches the optimal (minimax) regret rate and its empirical evaluation in a semi-synthetic experiment based on data from a randomized control trial of a web service is performed: we see that the proposed doubly-adaptive Thompson sampling has superior empirical performance to existing baselines in terms of cumulative regret and statistical power in identifying the best arm. Further, we extend this approach to contextual bandits, where there are more sources of bias present apart from the adaptive data collection -- such as the mismatch between the true data generating process and the reward model assumptions or the unequal representations of certain regions of the context space in initial stages of learning -- and propose the linear contextual doubly-adaptive Thompson sampling and the non-parametric contextual doubly-adaptive Thompson sampling extensions of our approach.
翻译:为了平衡勘探和开发,多武装匪盗算法需要利用迄今收集的数据,对每个手臂在每一步骤的每个步骤的真正平均报酬进行推断,但是,对所观察到的直到该时间步骤的军火和奖励的历史进行适应性收集,在用非二元数据进行推断方面已知存在挑战。特别是,抽样平均数在传统的上层信任约束算法和传统的汤普森抽样算法中起着突出作用,它们既不公正,也不无故正常。我们提出了一个基于汤普森抽样算法的变式,它利用因果推断文献的最新进展,并适应性地重新使用每个手臂真正平均奖赏的双重强的估算员的术语 -- -- 因此,其名称是加倍调整的汤普林森抽样法,在基于对网络服务进行随机控制试验的数据进行的半合成试验中,其经验评价与最佳的(最优的)道歉率率和半综合的实验中,我们发现,拟议的双调的汤普森抽样算法,在每一手臂的真正奖赏率方面优于现有的基线,从累积的排序角度和统计学前的准确度分析,我们从目前的准确的顺序推算取取了目前最深的顺序,从目前的统计学系到最深的顺序和统计学系中,从目前的统计学系取取取取取取取取取取取的最深。