Fitted Q-evaluation (FQE) is a central method for off-policy evaluation in reinforcement learning, but it generally requires Bellman completeness: that the hypothesis class is closed under the evaluation Bellman operator. This requirement is challenging because enlarging the hypothesis class can worsen completeness. We show that the need for this assumption stems from a fundamental norm mismatch: the Bellman operator is gamma-contractive under the stationary distribution of the target policy, whereas FQE minimizes Bellman error under the behavior distribution. We propose a simple fix: reweight each regression step using an estimate of the stationary density ratio, thereby aligning FQE with the norm in which the Bellman operator contracts. This enables strong evaluation guarantees in the absence of realizability or Bellman completeness, avoiding the geometric error blow-up of standard FQE in this setting while maintaining the practicality of regression-based evaluation.
翻译:拟合Q评估(FQE)是强化学习中离策略评估的核心方法,但其通常需要贝尔曼完备性:即假设类在评估贝尔曼算子下是封闭的。这一要求具有挑战性,因为扩大假设类可能破坏完备性。我们证明,这一假设的必要性源于一个基本的范数失配:贝尔曼算子在目标策略的平稳分布下是γ收缩的,而FQE在行为分布下最小化贝尔曼误差。我们提出一种简单的修正方案:使用平稳密度比估计对每个回归步骤进行重新加权,从而使FQE与贝尔曼算子的收缩范数对齐。这使得在缺乏可实现性或贝尔曼完备性的情况下仍能获得稳健的评估保证,避免了标准FQE在此场景下的几何误差爆炸,同时保持了基于回归评估的实用性。