分布强力批量批量上下文 (Distributionally Robust Batch Contextual Bandits)

Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data -- an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data. We first present a policy evaluation procedure that allows us to assess how well the policy does under the worst-case environment shift. We then establish a central limit theorem type guarantee for this proposed policy evaluation scheme. Leveraging this evaluation scheme, we further propose a novel learning algorithm that is able to learn a policy that is robust to adversarial perturbations and unknown covariate shifts with a performance guarantee based on the theory of uniform convergence. Finally, we empirically test the effectiveness of our proposed algorithm in synthetic datasets and demonstrate that it provides the robustness that is missing using standard policy learning algorithms. We conclude the paper by providing a comprehensive application of our methods in the context of a real-world voting dataset.

翻译：使用历史观测数据进行政策学习是一个重要问题,已经广泛应用。例子包括选择报价、价格、向客户发送广告,以及选择向病人开药的药物。然而,现有文献所依据的关键假设是,今后运用所学政策的环境与以往生成数据的环境相同 -- -- 这种假设往往是虚假的或过于粗略的近似 -- -- 在本文中,我们取消这一假设,目的是学习一种分布上稳健的政策,其中含有不完整的观测数据。我们首先提出一种政策评价程序,使我们能够评估该政策在最坏环境变化下的效果如何。然后,我们为这一拟议的政策评价计划设定一个中心限值类型保证。我们利用这一评价计划,进一步提出一种新的学习算法,能够学习一种对抗扰动性强和未知的共变,并以统一趋同理论为基础,保证性能。最后,我们从经验上检验了我们提出的合成数据集算法的有效性,并表明它提供了使用标准政策学习算法缺失的稳健性。我们通过提供真正的世界数据应用的方法来完成论文。