深海离线政策评价仪器变数倒退问题 (On Instrumental Variable Regression for Deep Offline Policy Evaluation)

We show that the popular reinforcement learning (RL) strategy of estimating the state-action value (Q-function) by minimizing the mean squared Bellman error leads to a regression problem with confounding, the inputs and output noise being correlated. Hence, direct minimization of the Bellman error can result in significantly biased Q-function estimates. We explain why fixing the target Q-network in Deep Q-Networks and Fitted Q Evaluation provides a way of overcoming this confounding, thus shedding new light on this popular but not well understood trick in the deep RL literature. An alternative approach to address confounding is to leverage techniques developed in the causality literature, notably instrumental variables (IV). We bring together here the literature on IV and RL by investigating whether IV approaches can lead to improved Q-function estimates. This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), where the goal is to estimate the value of a policy using logged data only. By applying different IV techniques to OPE, we are not only able to recover previously proposed OPE methods such as model-based techniques but also to obtain competitive new techniques. We find empirically that state-of-the-art OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE. We open-source all our code and datasets at https://github.com/liyuan9988/IVOPEwithACME.

翻译：我们通过尽量减少平方位贝曼的平均错误来估计国家行动价值(Q-功能)的流行强化学习(RL)战略,通过尽量减少平方位贝曼错误来估计国家行动值(Q-功能),导致一个倒退问题,因为投入和产出噪音相互关联。因此,直接尽量减少贝尔曼错误可能导致严重偏差的Q功能估计。我们解释了为什么在深Q-Networks和适合的Q-评价中确定目标Q-网络提供了克服这一混乱的方法,从而给这一广受欢迎但却在深层RLL文献中不为人所熟知的伎俩带来新的亮点。另一种解决混乱的方法是利用因果关系文献中开发的技术,特别是工具变量变量(IV)。我们在这里汇集了关于IV和RL的文献,调查了IV方法是否可导致改进Q-功能估计。本文分析并比较了离线政策评价(OPE)中近期四种方法的广泛范围,目的是仅仅利用已登录数据来估计政策的价值。我们采用不同的IV技术,我们无法在OP-EME文献中找到先前提出的OP-OP-E方法,尤其是工具,我们通过模型/OP-IMA-Corrial drod destal dest destal 方法,我们找到了O-I-I-I-I-I-ILT-ILO-S-S-S-I-I-S-S-ID-S-S-S-S-ID-ID-ID-S-S-S-S-S-S-S-S-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-