Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may see a patient many times and, at each visit, need to choose between prescribing either an invasive or a non-invasive procedure and postponing the decision to the next visit. In this paper, we develop an \say{advantage doubly robust} estimator for learning such dynamic treatment rules using observational data under sequential ignorability. We prove welfare regret bounds that generalize results for doubly robust learning in the single-step setting, and show promising empirical performance in several different contexts. Our approach is practical for policy optimization, and does not need any structural (e.g., Markovian) assumptions.
翻译:许多应用决策问题都有一个动态组成部分:决策者不仅需要选择治疗对象,而且需要何时开始治疗。例如,医生可能多次见到病人,每次检查时,需要选择一种侵入性或非侵入性程序,将决定推迟到下一次检查。在本文中,我们开发了一种“say{advantage diverty strongy}” 标准,用于学习这种动态治疗规则,在连续忽略的情况下使用观测数据。我们证明,福利方面的遗憾是,在单步学习中将结果普遍化,在几种不同的情况下表现出有希望的经验性表现。我们的方法对政策优化是实用的,不需要任何结构性的假设(例如,Markovian)。