We present an algorithm, HOMER, for exploration and reinforcement learning in rich observation environments that are summarizable by an unknown latent state space. The algorithm interleaves representation learning to identify a new notion of kinematic state abstraction with strategic exploration to reach new states using the learned abstraction. The algorithm provably explores the environment with sample complexity scaling polynomially in the number of latent states and the time horizon, and, crucially, with no dependence on the size of the observation space, which could be infinitely large. This exploration guarantee further enables sample-efficient global policy optimization for any reward function. On the computational side, we show that the algorithm can be implemented efficiently whenever certain supervised learning problems are tractable. Empirically, we evaluate HOMER on a challenging exploration problem, where we show that the algorithm is exponentially more sample efficient than standard reinforcement learning baselines.
翻译:我们展示了一种算法,即HOMER,用于在由未知的潜伏状态空间加以总结的丰富观测环境中进行探索和强化学习。算法之间的相互代表制学习,以利用所学的抽象空间确定新的动态状态抽象概念,并进行战略探索,以便利用所学的抽象空间到达新的国家。算法可以想象地探索环境,其样本复杂程度在潜在的状态和时间范围上是多元的,而且关键是,不依赖观测空间的大小,因为观测空间可能是无限大的。这种探索进一步保证了样本高效的全球政策优化能够用于任何奖赏功能。在计算方面,我们表明只要某些受监督的学习问题可以伸缩,算法就可以有效地实施。我们很生动地评估HOMR关于一个具有挑战性的探索问题的情况,我们在那里显示,该算法比标准的强化学习基线具有惊人的样本效率。