Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking because they can both improve outcomes for study participants and increase the chance of identifying good or even best policies. To support credible inference on novel interventions at the end of the study, nonetheless, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies. The adaptive nature of the data collected by contextual bandit algorithms, however, makes this difficult: standard estimators are no longer asymptotically normally distributed and classic confidence intervals fail to provide correct coverage. While this has been addressed in non-contextual settings by using stabilized estimators, the contextual setting poses unique challenges that we tackle for the first time in this paper. We propose the Contextual Adaptive Doubly Robust (CADR) estimator, the first estimator for policy value that is asymptotically normal under contextual adaptive data collection. The main technical challenge in constructing CADR is designing adaptive and consistent conditional standard deviation estimators for stabilization. Extensive numerical experiments using 57 OpenML datasets demonstrate that confidence intervals based on CADR uniquely provide correct coverage.
翻译:然而,我们仍希望就平均治疗效果、分组效应或新政策的价值建立有效的信任间隔。由背景强盗算法收集的数据的适应性使得这种困难重重:标准估量器不再正常地在正常情况下分配,典型的信任期不能提供正确的覆盖。虽然在非通俗环境中通过使用稳定的估测器解决了这一问题,但背景环境环境环境环境构成了我们首次在本文中应对的独特挑战。我们提出了环境适应性多布利 Robust(CADR)估计器,这是在环境适应性数据收集下政策价值的第一个估计器。在构建CADR时,主要的技术挑战是如何设计适应性和一致性的有条件偏差定值器,以稳定为目的。