Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
翻译:边际重要性抽样(MIS)衡量目标政策的国家行动占用率和抽样分布率之间的密度比率,是非政策评价的一个很有希望的方法,然而,目前最先进的MIS方法依赖于复杂的优化技巧,主要依靠简单的玩具问题,我们通过观察可以从目标政策的后继代表制中计算密度比率,缩小MIS与深度强化学习之间的差距;后续代表制可以通过深入强化学习方法进行培训,并将奖励的优化与环境动态脱钩,从而使由此产生的算法稳定并适用于高维领域;我们评估我们在各种具有挑战性的阿塔里和穆乔科环境方面的做法的经验表现。