We introduce Iterated Bellman Calibration, a simple, model-agnostic, post-hoc procedure for calibrating off-policy value predictions in infinite-horizon Markov decision processes. Bellman calibration requires that states with similar predicted long-term returns exhibit one-step returns consistent with the Bellman equation under the target policy. We adapt classical histogram and isotonic calibration to the dynamic, counterfactual setting by repeatedly regressing fitted Bellman targets onto a model's predictions, using a doubly robust pseudo-outcome to handle off-policy data. This yields a one-dimensional fitted value iteration scheme that can be applied to any value estimator. Our analysis provides finite-sample guarantees for both calibration and prediction under weak assumptions, and critically, without requiring Bellman completeness or realizability.
翻译:我们提出了一种简单、模型无关、后处理的迭代贝尔曼校准方法,用于校准无限时域马尔可夫决策过程中的离策略价值预测。贝尔曼校准要求具有相似预测长期回报的状态,在目标策略下展现出与贝尔曼方程一致的单步回报。通过使用双重稳健伪结果处理离策略数据,我们反复将拟合的贝尔曼目标回归到模型的预测上,从而将经典直方图校准和等渗校准方法适配到动态、反事实的场景中。这产生了一个一维的拟合价值迭代方案,可应用于任何价值估计器。我们的分析在弱假设下为校准和预测提供了有限样本保证,并且关键地,无需贝尔曼完备性或可实现性假设。