离线政策评价的 " 学习贝尔曼 " 完整代表机构 (Learning Bellman Complete Representations for Offline Policy Evaluation)

We study representation learning for Offline Reinforcement Learning (RL), focusing on the important task of Offline Policy Evaluation (OPE). Recent work shows that, in contrast to supervised learning, realizability of the Q-function is not enough for learning it. Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. Prior work often assumes that representations satisfying these conditions are given, with results being mostly theoretical in nature. In this work, we propose BCRL, which directly learns from data an approximately linear Bellman complete representation with good coverage. With this learned representation, we perform OPE using Least Square Policy Evaluation (LSPE) with linear functions in our learned representation. We present an end-to-end theoretical analysis, showing that our two-stage algorithm enjoys polynomial sample complexity provided some representation in the rich class considered is linear Bellman complete. Empirically, we extensively evaluate our algorithm on challenging, image-based continuous control tasks from the Deepmind Control Suite. We show our representation enables better OPE compared to previous representation learning methods developed for off-policy RL (e.g., CURL, SPR). BCRL achieve competitive OPE error with the state-of-the-art method Fitted Q-Evaluation (FQE), and beats FQE when evaluating beyond the initial state distribution. Our ablations show that both linear Bellman complete and coverage components of our method are crucial.

翻译：我们研究离线强化学习(RL),重点是离线政策评价的重要任务。最近的工作表明,与监督学习相比,最起码的平方政策评价(LSPE)的可实现性并具有线性功能,这不足以进行学习。对于具有抽样效率的OPE来说,两个充分的条件是Bellman的完整性和覆盖面。先前的工作往往假设,所给出的表情符合这些条件,其结果大多是理论性的。在这项工作中,我们提议BCRLL,直接从数据中学习大约线性贝尔曼完整的表情,其覆盖面很好。有了这种学习,我们用最起码的平方政策评价(LSPE)来进行OPE,其作用与我们所学到的直线性代表关系。我们提出一个端对端对端到端的理论分析,表明我们两阶段的算法都具有多元性样本复杂性,在考虑的富人阶层中提供了一定的表情贝尔曼。我们广泛评价了“深色控制套件”的具有挑战性、基于图像的持续控制任务的算法。我们表现出了更好的OPE,比以前为脱政策的RL(例如CUR、S-CURL、Sleval-E)和“Flaveal E”方法都展示了我们的“我们的直射线性分析方法,从而展示了我们的“F-C-C-C-Q-E-E-C-C-C-S-BRVI-C-C-C-S-E-S-S-E-SVI-SVI-I-I-I-I-I-I-I-I-I-I-I-I-E-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-S-I-I-I-I-I-S-S-S-S-I-I-I-I-S-S-S-S-S-S-S-S-S-S-S-E-E-E-S-S-S-S-S-S-S-S-E-S-S-S-S-S-E-S-S-S-S-S-S-S-S-S-S-S-S-S